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APPEAL BRIEF 

Sir: 

Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the December 1 8, 2002 (Paper No. 17). The 
Noticeof Appeal was timely submitted on March 13, 2003, and was receivedin the Patent andTrademark 
Office ("the Office") on March 18, 2003. This Appeal Brief is timely submitted in light of the concurrently 
filed Petition for an Extension of Time of four months to and including September 1, 2003 and authorization 
to deduct the fee as required under 37 C.F.R. § 1 .17(a)(4) from Appellants' Representatives' deposit 
account. The Commissioner is also authorized to charge the fee for filing this Appeal Brief ($160.00), as 
required under 37 C.F.R. § 1.17(c), to Lexicon Genetics Incorporated Deposit Account No. 50-0892 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief. However, should any additional fees under 
37C.F.R. §§ 1.16to 1.21berequiredforanyreasonrelatedtothiscommunication,theCommissioner 
is authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8800 Technology Forest 
Place, The Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 
Appellants know of no related appeals or interferences. 

III. STATUS OF THE CLAIMS 

The present application was filed on October 31, 2000, claiming thebenefit of U.S. Provisional 
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Application Number 60/163,01 8, which was filed on November 2, 1999, and included original claims 
1-4. 

The Examiner issued a Restriction and Election Requirement separating the original claims into three 
separate and distinct inventions, and in a telephone conversation Appellants elected of Group I (claims 
1-2), with traverse, for prosecution on the merits. 

A First Official Action, was issued on December 19, 2001 ("the First Action" Paper No: 1 1 ), 
Claims 1-2 were rejected under 35 U.S.C. § 101, due to the alleged lack of patentable utility, claims 1- 
2 were also rejected under 35 U.S.C. § 1 12, first paragraph, as allegedly unusable by the skilled 
artisan due to the alleged lack of patentable utility, claim 2 was rejected under 35 U.S.C. § 112, 
second paragraph, as allegedly indefinite, claim 1 is rejected under 35 U.S.C. §■ 102(b) as allegedly 
being anticipated and claims 3-4 were withdrawn from further consideration by examiner as being 
drawn to a non-elected invention. 

In a response to the First Official Action, submitted to the Office on April 19, 2002 
("response to the First Action"), Appellants acknowledged the Restriction and Election Response and . 
amended claim 2 to further improve its clarity. 

A Second Official Action, was issued on July 9, 2002 ("the Second Action": Paper No 14), 
claims 1-2 were maintained under 35 U.S.C. § 101, due to the alleged lack of patentable utility, 
rejection of claims 1-2 was also maintained under 35 U.S.C. § 112, first paragraph, as allegedly 
unusable by the skilled artisan due to the alleged lack of patentable utility, rejection of claim 2 was 
maintained under 35 U.S.C. § 112, second paragraph, as allegedly indefinite, the rejection to Claim 1 
was withdrawn under 35 U.S.C. § 102(b), but claim 1 was rejected under 35 U.S.C. § 102(a) as 
allegedly being anticipated. 

■In a response to the Second Official Action, submitted to the Office on November 1 1 , 2002 
("response to the Second Action"), Appellants amended Claim 1 and new claims 5-7 were added to 
further improve its clarity and Claim 2 was canceled without prejudice and without disclaimer. 

A third and Final Official Action, was issued on December 18, 2002 (the "Final Action" : Paper 
No. 17), in which rejection of claims 1 and 5-7 was maintained under 35 U.S.C. § 101 and 
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35 U.S.C. § 112, first paragraph and in view of Appellants amendments to the claim, the rejection of 
Claim 1 was withdrawn under 35 U.S.C. § 102(a). 

In a response to the Final Action, submitted on April 18, 2003 ("response to the Final Action") 
Appellants again addressed the outstanding rejections of claims 1 and 5-7. 

An Advisory Action ("the Advisory Action") was mailed on May 5, 2003, maintaining the 
rejection of claims 1 and 5-7 were maintained under 35 U.S.C. § 101 as allegedly lacking a patentable 
utility and under 35 U.S.C. § 1 12, first paragraph, as one skilled in the art clearly would not know how 
to use the skilled invention. A copy of the appealed claims is included below in the Appendix (Section 
IX). 

IV. STATUS OF THE AMENDMENTS 

For the purposes of Appeal Appellants believe that no outstanding amendments exist. 

V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants 5 discovery and identification of novel human 
sequences that encodes a novel isoform of an ATP-binding cassette transporter protein, a class of 
proteins that are well known to be involved in mammalian multi-drug resistance( Page 2, lines 7-8). 
The specification details a number of uses for the presently claimed polynucleotide sequences, including 
the detection and diagnosis of human disease (page 12) as well as to therapeutically augment the 
efficacy of chemotherapeutic agents used in the treatment of breast or prostate cancer (page 14, lines 
4-6) . The sequences of the present invention are noted to be expressed in prostate (page 3, line 10). 
Additional uses include assessing temporal and tissue specific gene expression patterns (specification at 
page 5, line 15-18), particularly using a high throughput "chip" format (specification at page 5, line 19- 
22), mapping the sequences to a specific region of a human chromosome and identifying protein 
encoding regions and determining the genomic structure (specification at page 8, lines 11-16). As a still 
further example of utility is the use of the present sequences in such diagnostic assays (at least at page 
14, line 1) as those associated with identification of paternity and forensic analysis, among others. The 
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sequences of the present invention have particular utility as the application as filed identified several 
polymorphisms (page 13, lines 16-25). 

VI. ISSUES ON APPEAL 

1. Do claims 1, and 5-7 lack a patentable utility? 

2. Are claims 1 and 5-7 unusable by a skilled artisan due to a lack of patentable utility? 



VII. GROUPING OF THE CLAIMS 

For the purposes of the outstanding rejections under 35 U.S.C. § 101 and 35 U.S.C. § 112, 
first paragraph, the claims will stand or fall together. 

VIII. ARGUMENT 

A. Do Claims 1 and 5- 7 Lack a Patentable Utility? 

The Final Action first rejects claims 1 and 5 - 7 under 35 U.S.C. § 101, as allegedly lacking a 
patentable utility due to not being supported by either a specific and substantial utility or a well- 
established utility, this rejection is maintained in the Advisory Action. 

Appellants strongly disagree, as the specification details a number of specific and substantial 
utilities for the presently claimed polynucleotide sequences which encode a novel isoform of an ATP- 
binding cassette transporter protein, a class of proteins that are well known to be involved in 
mammalian multi-drug resistance( Page 2, lines 7-8). The specification details a number of uses for the 
presently claimed polynucleotide sequences, including the detection and diagnosis of human disease 
(page 12) as well as to therapeutically augment the efficacy of chemotherapeutic agents used in the 
treatment of breast or prostate cancer (page 14, lines 4-6) . The sequences of the present invention are 
noted to be expressed in prostate (page 3, line 10). Additional uses include assessing temporal and 
tissue specific gene expression patterns (specification at page 5, line 15-18), particularly using a high 
throughput "chip" format (specification at page 5, line 19-22), mapping the sequences to a specific 
region of a human chromosome and identifying protein encoding regions and determining the genomic 
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structure (specification at page 8, lines 11-16). As a still further example of utility is the use of the 
present sequences in such diagnostic assays (at least at page 14, line 1) as those associated with 
identification of paternity and forensic analysis, among others. The sequences of the present invention 
have particular utility as the application as filed identified several polymorphisms (page 13 : , lines 16-25), 

Appellants would like to invite the Board's attention to the fact that a sequence sharing 94% 
identity at the nucleic acid level with the sequences of the present invention is present in the leading 
scientific repository for biological sequence data (GenBank), and has been annotated by third party 
scientists wholly unaffiliated with Appellants as ATP-binding cassette, sub-family C, member 11 
isoform a; multi-resistance protein 8 (GenBank accession number NP_1 15972; abstract, alignment 
and GenBank report provided in Exhibit A) and as ATP-binding cassette, sub-family C, member 1 1 
isoform b; multi-resistance protein 8 (GenBank accession number NP_660187; abstract, alignment 
and GenBank report provided in Exhibit B) . The legal test for utility simply involves an assessment of 
whether those skilled in the art would find any of the utilities described for the invention to be credible 
or believable. Given this GenBank annotation, there can be little question that those skilled in the art 
would clearly believe that Appellants' sequence is a novel human isoform of the ATP-binding cassette, 
sub-family C, member 11; multi-resistance protein 8. Thus, the present claims clearly meet the 
requirements of 35 U.S.C. § 101. 

The Advisory Action (at page 2, lines 6-7), states that "post filing references can only be used 
to support an asserted utility in the specification. Appellants have only disclosed that in their 
specification that the protein of the present invention was believed to be an MDR protein." and that 
Appellants did not know the identity of the protein encoded by the sequences of the of present 
invention "thereby supporting the Examiner's position that utility was not known at the time of filing." 
However, Appellants respectfully submit that the issue with regards to 35U.S.C, section 101 is one of 
utility, not identity or nomenclature and that the legal test for utility simply involves an assessment of 
whether those skilled in the art would find any of the utilities described for the invention to be credible 
or believable. The application as filed clearly describes the current invention as a novel human 
transporter protein (inter alia, title, page 1) and the function of transporter proteins as integral 
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membrane proteins that mediate or facilitate the passage of materials across the lipid bilayer (page 1, 
lines 26-28) and identifies their role as a mechanism of drug resistance wherein diseased cells using 
cellular transporter systems to export chemotherapeutic agents from the cell (page 1, line 30-33) and 
later in the specification asserts a utility in augmenting the efficacy of chemotherapeutic agents used in 
the treatment of breast or prostate cancer (specification at page 14, lines 8-10). 

Appellants have asserted that the present invention is a human transporter protein, and 
provided evidence that the sequences of the present invention indeed encode a transporter protein, in 
particular, a variant that encodes and isoform of the ATP-binding cassette, sub-family C, member 11; 
multi-resistance protein 8. In light of the well-established fact that ATP-binding cassette transporters 
are known to the art to be frequently associated with multiple drug resistance by cancer cells and that 
mutations in these genes can cause accelerated removal of chemotherapeutic agents, it is clear the 
present invention has utility. Appellants have further asserted that similar MDR encoding sequences, 
uses, and applications that are germane to the proteins encoded by the sequences of the present 
invention, were described in issued U.S. Patents Nos. 5,198,344 and 5,866,699 which were 
incorporated by reference in their entirety into the present application. 

The well-established utility of the class of transporter proteins encoded by the sequences of the 
present invention is further evidenced by the NCBI LocusLink summary for ABCC1 1 genetic locus. 

"The protein encoded by this gene is a member of the superfamily of ATP-binding cassette 
(ABC) transporters. ABC proteins transport various molecules across extra- and intra- 
cellular membranes." 

"This ABC full transporter is a member of the MRP subfamily which is involved in multi- 
drug resistance. It is expressed at low levels in all tissues, except kidney, spleen, and 
colon. This gene and family member ABCC12 are determined to be derived by duplication 
and are both localized to chromosome 16ql2.1. Their chromosomal localization, potential 
function, and expression patterns identify them as candidates for paroxysmal kinesigenic 
choreoathetosis, a disorder characterized by attacks of involuntary movements and postures, 
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chorea, and dystonia. Multiple alternatively spliced transcript variants have been described 
for this gene." 

r http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=:85320) 

Clearly the utility of ABCC 1 1 transporter proteins, and thus logically the sequences of the present 
invention which encode an ABCC 11 transporter protein isoform, have a very well established utility 
that is readily recognized by those of skill in the art. 

In the Second Official Action (Paper No. 14) references, Tammur et al (Exhibit C) and 
Yabuuchi et al (Exhibit D) are used to attempt to discredit Appellants' assertion of utility. However 
these publications support rather than dispute Appellants assertion that the present invention has utility 
and is a splice variant of ABCC1 1. For example, Tammur et al, in the final paragraph of the 
introduction (page 90, 4 th paragraph), state that they had undertaken a long-term project of cloning 
new human ABC transporters and linking them to various disease phenotypes and have identified 
ABCC1 1 and ABCC12 as two such members. Thus, clearly, Tammur et al., recognize the value and 
utility of ABCC11 and ABCC12 and their association with human diseases. In addition, with regard to 
function, Tammur et al, state on page 93, lines 8-10 that "it would be reasonable to suggest that 
ABC11 and ABCC12 could share functional similarities with ABCC4 and ABCC5." Said function 
being recognized by the art as the transport of organic anions, nucleotide analogs and cyclic 
nucleotides. Thus rather than contradicting the utility of the present invention the conclusions of 
Tammur et al support the position that those of skill in the art would recognize Appellants' asserted 
utility of the present invention as credible. 

Yabuuchi et al clearly supports Appellants' assertion that the present invention is a splice 
variant of ABCC1 1, for there appear to be many such variants. And although these authors speculate 
that "splice variants may represent diverse biological functions" (emphasis added), this speculation is 
not supported by any data or based on any fact or reference and thus appears to be pure speculation , 
"Therefore, it is of interest to know whether some of these splice variants... represent biological 
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functions" (pg 937, lines 17-19). However, Yabuuchi et al also recognize in their concluding remarks 
the utility of ABCC1 1 with regard to human disease and therefore also indicate that Appellants' utility 
assertions as credible. Further recognition of the utility of ABCC11 sequences is provided by other 
scientific publications, such as that of Turriziani, et al, (Impaired 2',3'-dideoxy-3'-thiacytidine 
accumulation in T-lymphoblastoid cells as a mechanism of acquired resistance independent of multidrug 
resistant protein 4 with a possible role for ATP-binding cassette Cll, Biochem. J. 368, 325-332 
,2002: Exhibit E). Turriziani, et al, describe the finding that increased expression of ATP-binding 
cassette CI 1 (ABCC1 1) was observed in the CEM 3TC cells and that the decreased 3TC 
accumulation in the CEM 3TC might be due to the upregulation of ABCC1 1 . 

Clearly evidence supports Appellants' assertions that the sequences of the present invention 
which encode a novel human transporter protein,(an isoform of the ATP-binding cassette, sub-family 
C, member 11; multi-resistance protein 8) have well established utility that is recognized by those of 
skill in the art. 

Furthermore, this situation parallels Example 10 of the PTO's Revised Interim Utility 
Guidelines Training Materials (pages 53-55), which establishes that a rejection under 35 U.S.C. § 101 
as allegedly lacking a patentable utility and under 35 U.S.C. § 1 12, first paragraph as allegedly 
unusable by the skilled artisan due to the alleged lack of patentable utility, is not proper when there is no 
reason to doubt the asserted utility of a full length sequence (such as the presently claimed sequence) 
that has a similarity to a protein having a known function. In the Analysis portion of Example 10 it 
states that "Based on applicant's disclosure and the results of the PTO search, there is no reason to 
doubt the assertion that SEQ ED NO:2 encodes a DNA ligase . Further DNA ligases have a well- 
established use in the molecular biology art based on this class of proteins ability to ligate DNA. 

Note that if there is a well-established utility already associated with the claimed invention, the 

utility need not be asserted in the specification as filed Thus the conclusion reached from this 

analysis is that a 35 U.S.C. § 101 and a 35 U.S.C. § 112 first paragraph, utility rejection should not be 
made." 
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The present case is similar to that presented in Example 10 of the Revised Interim Utility 
Guidelines Training Materials (pages 53-55). In the present case it is clear that the sequences of the 
present invention encode an ATP-binding cassette (ABC) transporter. ATP-binding cassette (ABC) 
transporters have a well-established utility. "Note that if there is a well-established utility already 
associated with the claimed invention, the utility need not be asserted in the specification as filed...Thus 
the conclusion reached from this analysis is that a 35 U.S.C. § 101 and a 35 U.S.C. § 112 first 
paragraph, utility rejection should not be made." Thus the rejection of the presently claimed invention 
under a 35 U.S.C. § 101 and a 35 U.S.C. § 112 first paragraph utility rejection should be overruled. 

The Advisory action also discounts Appellants' assertion regarding the use of the presently 
claimed polynucleotides on DNA gene chips, based on the position that such a use would allegedly be 
generic. Further, these Actions seem to be requiring Appellants to identify the biological role of the 
nucleic acid or function of the protein encoded by the presently claimed polynucleotides before the 
present sequences can be used in gene chip applications that meet the requirements of § 101. 
Appellants respectfully point out that knowledge of the exact function or role of the presently claimed 
sequence is not required to track expression patterns using a DNA chip. As set forth in at least 
Appellants Response to Final, given the widespread utility of such "gene chip" methods using public 
domain gene sequence information, there can be little doubt that the use of the presently described 
novel sequences would have great utility in such DNA chip applications. 

Clearly, the claimed sequences provide a specific marker of the gene encoding an ABC 
transporter protein and provide a unique identifier of the corresponding gene in the human genome. 
Such specific markers are targets for discovering drugs that are associated with human kidney disease, 
such as congenital nephrotic syndrome. Thus, those skilled in the art would instantly recognize that the 
present nucleotide sequence would be an ideal, novel candidate for assessing gene expression using, for 
example, DNA chips, as the specification details at least on page 5, line 19-22. Such "DNA chips" 
clearly have utility, as evidenced by hundreds of issued U.S. Patents, exemplified by U.S. Patent Nos. 
5,445,934 (Exhibit F), 5,556,752 (Exhibit G), 5,744,305 (Exhibit H), as well as more recently 
issued U.S. Patent Nos. 5,837,832 (Exhibit I), 6,156,501 (Exhibit J) and 6,261,776 (Exhibit K). 
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The Board is further requested to consider that, given the huge expense of the drug discovery 
process, even negative information has great "real world" practical utility. Knowing that a given gene is 
not expressed in medically relevant tissue provides an informative finding of great value to industry by 
allowing for the more efficient deployment of expensive drug discovery resources. Such ppactical 
considerations are equally applicable to the scientific community in general, in that time and resources 
are not wasted chasing what are essentially scientific dead-ends (from the perspective of medical 
relevance). Clearly, compositions that enhance the utility of such DNA gene chips, such as the 
presently claimed sequences encoding ATP-binding cassette (ABC) transporters, must in themselves 
be useful. Moreover, the presently described ABC transporter provides uniquely specific sequence 
resources for identifying and quantifying full length transcripts that were encoded by the corresponding 
human genomic locus. Accordingly, there can be no question that the described sequences provide an 
exquisitely specific utility for analyzing gene expression. 

Additionally, only a small percentage of the genome (2-4%) actually encodes exons, which in 
turn encode amino acid sequences. Thus, not all human genomic DNA sequences are useful in such 
gene chip applications. This further discounts the Examiner's position that such uses are "generic". 
The present claims clearly meet the requirements of 35 U.S.C. § 101. It has been clearly established 
that a statement of utility in a specification must be accepted absent reasons why one skilled in the art 
would have reason to doubt the objective truth of such statement. In re Longer, 503 F.2d 1380, 
1391, 183 USPQ 288, 297 (CCPA, 1974); In re Manocchi, 439 F.2d 220, 224, 169 USPQ 367, 
370 (CCPA, 1971). 

Evidence of the "real world" substantial utility of the present invention is further provided by the 
fact that there is an entire industry based on the use of gene sequences or fragments thereof in a gene 
chip format. Perhaps the most notable gene chip company is Affymetrix. However, there are many 
companies which have, at one time or another, concentrated on the use of gene sequences or 
fragments, in gene chip and non-gene chip formats, for example: Gene Logic, ABI-Perkin-Elmer, 
HySeq and Incyte. In addition, one such company, Rosetta Inpharmatics, was viewed to have such 
"real world" value that it was acquired by large pharmaceutical company, Merck & Co., for substantial 
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sums of money (net equity value of the transaction was $620 million). The "real world" substantial 
industrial utility of gene sequences or fragments would, therefore, appear to be widespread and well 
established. Clearly, persons of skill in the art, as well as venture capitalists and investors, readily 
recognize the utility, both scientific and commercial, of genomic data in general, and specifically human 
genomic data. Billions of dollars have been invested in the human genome project, resulting in useful 
genomic data (see, e.g., Venter et al, 2001, Science 297:1304; Exhibit L). The results have been a 
stunning success as the utility of human genomic data has been widely recognized as a great gift to 
humanity (see, e.g., Jasny and Kennedy, 2001, Science 297:1153; Exhibit M). Clearly, the usefulness 
of human genomic data, such as the presently claimed nucleic acid molecules, is substantial and credible 
(worthy of billions of dollars and the creation of numerous companies focused on such information) and 
well-established (the utility of human genomic information has been clearly understood for many years). 

As a still further example of utility is the use of the present sequences in such diagnostic assays 
(at least at page 14, line 1) as those associated with identification of paternity and forensic analysis, 
among others. The sequences of the present invention have particular utility as the application as filed 
identified several polymorphisms (page 13, linesl6-25). This is also not a case of a potential utility. 
Appellants respectfully submit that even in the worst case scenario, the described polymorphisms are 
each useful to distinguish 50% of the population (in other words, the marker being present in half of the 
population) and that the ability of a polymorphic marker to distinguish at least 50% of the population is 
an inherent feature of any polymorphic marker, and this feature is well understood by those of skill in 
the art. Appellants note that as a matter of law, it is well settled that a patent need not disclose what is 
well known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). Appellants support for 
Appellants' assertion of utility is provided by the fact that the skilled artisan would readily recognize and 
easily believe that the presently described polymorphic markers could be useful in forensic analysis. 
The fact that forensic biologists use polymorphic markers such as those described by Appellants every 
day provides more that ample support for the assertion that forensic biologists would also be able to 
use the specific polymorphic markers described by Appellants in the same fashion. Therefore, again it is 
clear that the sequences of the present invention have utility. 



-11- 



Given the physiologic activity and importance of ABC transporters known to those of skill in 
the art, those of skill in the art would readily appreciate the importance of tracking the expression of the 
genes encoding the described proteins, particularly due to well established role of ABC transporters in 
drug resistance in cancer cells. In the present case this apparent utility is further bolstered by the 
expression of the sequences of the present invention in the prostate, a tissue which when involved in 
cancer often under goes multiple drug resistance. The use of the claimed polypeptide in an array for 
screening purposes Appellants respectfully point out that nucleic acid sequences have the greatest 
specific utility in gene chip applications once the role of the sequence has been identified, as have tissues 
of interest, as in the present case. Once the role of the particular nucleic acid is known, the level of 
gene expression has and even greater significance. By identifying the physiological activity role of the 
claimed sequence, the claimed sequence has a far greater utility in gene chip applications that just any 
random piece of DNA. Appellants respectfully submit that specific utility, which is the proper standard 
for utility under 35 U.S.C. § 101, is distinct from the requirement for a unique utility, which is clearly an 
improper standard. As clearly stated by the Federal Circuit in Carl Zeiss Stiftung v. Renishaw PLC, 
20 USPQ2d 1 101 (Fed. Cir. 1991; "Car/ Zeiss"): 

An invention need not be the best or only way to accomplish a certain result, and it 
need only be useful to some extent and in certain applications: "[T]he fact that an 
invention has only limited utility and is only operable in certain applications is not 
grounds for finding a lack of utility." Envirotech Corp. v. Al George, Inc. , 221 USPQ 
473, 480 (Fed. Cir. 1984) 

Therefore, just because other nucleic acid sequences find utility in gene chip applications does not mean 
that the use of Appellants' sequence in gene chip applications is not a specific utility. Furthermore, the 
requirement for a unique utility is clearly not the standard adopted by the Patent and Trademark. Office. 
If every invention were required to have a unique utility, the Patent and Trademark Office would no 
longer be issuing patents on batteries, automobile tires, golf balls, golf clubs, and treatments for a variety 
of human diseases, such as cancer and bacterial or viral infections, just to name a few particular 
examples, because examples of each of these have already been described and patented. All batteries 
have the exact same utility - specifically, to provide power. All automobile tires have the exact same 
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utility - specifically, for use on automobiles. All golf balls and golf clubs have the exact same utility - 
specifically, use in the game of golf. All cancer treatments have the exact same utility - specifically, to 
treat cancer. All anti-infectious agents have the exact same broader utility - specifically, to treat 
infections. However, only the briefest perusal bf virtually any issue of the Official Gazette provides 
numerous examples of patents being granted on each of the above compositions every week . 
Furthermore, if a composition needed to be unique to be patented, the entire class and subclass system 
would be an effort in futility, as the class and subclass system serves solely to group such common 
inventions, which would not be required if each invention needed to have a unique utility. Thus, the 
present sequence clearly meets the requirements of 35 U.S.C. § 101. 

Further evidence of utility of the presently claimed polynucleotide, although only one is needed 
to meet the requirements of 35 U.S.C. § 101 (Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); 
In re Gottlieb, 140 USPQ 665 (CCPA 1964); In re Malachowski, 189 USPQ 432 (CCPA 1976); 
Hoffman v. Klaus, 9 USPQ2d 1657 (Bd. Pat. App. & Inter. 1988)), is the specific utility the present 
nucleotide sequence has in determining the genomic structure of the corresponding human chromosome 
(specification at page 14, lines 9-10) , for example mapping the protein encoding regions as described 
in the specification (page 3, line 26-29) and evidenced below. Clearly, the present polynucleotide 
provides exquisite specificity in localizing the specific region of the human chromosome containing the 
gene encoding the given polynucleotide, a utility not shared by virtually any other nucleic acid sequence. 
In fact, it is this specificity that makes this particular sequence so useful. Early gene mapping techniques 
relied on methods such as Giemsa staining to identify regions of chromosomes. However, such 
techniques produced genetic maps with a resolution of only 5 to 10 megabases, far too low to be of 
much help in identifying specific genes involved in disease. The skilled artisan readily appreciates the 
significant benefit afforded by markers that map a specific locus of the human genome, such as the 
present nucleic acid sequence. 

Only a minor percentage of the genome actually encodes exons, which in turn encode amino 
acid sequences. The presently claimed polynucleotide sequence provides biologically validated 
empirical data (e.g., showing which sequences are transcribed, spliced, and polyadenylated) that 
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specifically defines that portion of the corresponding genomic locus that actually encodes exon 
sequence. Equally significant is that the claimed polynucleotide sequence defines how the encoded 
exons are actually spliced together to produce an active transcript (i.e., the described sequences are 
useful for functionally defining exon splice-junctions). The Appellants respectfully submit that the 
practical scientific value of expressed, spliced, and polyadenylated mRNA sequences is readily 
apparent to those skilled in the relevant biological and biochemical arts. For further evidence 
supporting the Appellants' position, the Board is requested to review, for example, section 3 of Venter 
et al {supra at pp. 1317-1321, including Fig. 11 at pp.1324-1325), which demonstrates the 
significance of expressed sequence information in the structural analysis of genomic data. The presently 
claimed polynucleotide sequence defines a biologically validated sequence that provides a unique and 
specific resource for mapping the genome essentially as described in the Venter et al. article. 

As still further evidence supporting Appellants' assertions of the specific utility of the sequences 
of the present invention in localizing the specific region of the human chromosome and identification of 
functionally active intron/exon splice junctions is the information provided in Exhibit N. This is the 
result of a blast analysis using SEQ ID NO:23 of the present invention when compared to the identified 
human genomic sequence. This result indicates that the sequence of the present invention is encoded by 
25 exons spread non-contiguously along a region of human chromosome 16, which is contained within 
represented by clone, AC0076005. Thus clearly one would not simply be able to identify the 25 
protein encoding exons that make up the sequence of the present intention from within the large 
genomic sequence. Nor, would one be able to map the protein encoding regions identified specifically 
by the sequences of the present invention without knowing exactly what those specific sequences were 

Rather, the question of utility is a straightforward one. As set forth by the Federal Circuit, 
"(t)he threshold of utility is not high: An invention is 'useful' under section 101 if it is capable of 
providing some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. 
Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). Additionally, the Federal Circuit 
has stated that "(t)o violate § 101 the claimed device must be totally incapable of achieving a useful 
result." Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 (Fed. Cir. 1992), 
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emphasis added. Cross v. lizuka (224 USPQ 739 (Fed. Cir. 1985); "Cross") states "any utility of the 
claimed compounds is sufficient to satisfy 35 U.S.C. § 101". Cross at748, emphasis added. Indeed, 
the Federal Circuit recently emphatically confirmed that "anything under the sun that is made by man" is 
patentable (State Street Bank & Trust Co. v. Signature Financial Group Inc., 47 USPQ2d 1596, 
1600 (Fed. Cir. 1998), citing the U.S. Supreme Court's decision in Diamond vs. Chakrabarty, 206 
USPQ 193 (S.Ct. 1980)). 

The legal test for utility simply involves an assessment of whether those skilled in the art would 
find any of the utilities described for the invention to be credible or believable. According to the 
Examination Guidelines for the Utility Requirement, if the applicant has asserted that the claimed 
invention is useful for any particular purpose (i.e., it has a "specific and substantial utility") and the 
assertion would be considered credible by a person of ordinary skill in the art, the Examiner should not 
impose a rejection based on lack of utility (66 Federal Register 1098, January 5, 2001). 

In In re Brana, (34 USPQ2d 1436 (Fed. Cir. 1995), "Brana"), the Federal Circuit 
admonished the P.T.O. for confusing "the requirements under the law for obtaining a patent with the 
requirements for obtaining government approval to market a particular drug for human consumption". 
Brana at 1442. The Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical 
inventions, what must the applicant provide regarding the practical utility or usefulness 
of the invention for which patent protection is sought. This is not a new issue; it is one 
which we would have thought had been settled by case law years ago . 

Brana at 1439, emphasis added. The choice of the phrase "utility or usefulness" in the foregoing 
quotation is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejectionsiinder 
35 U.S.C. § 101, and is using "usefulness" to refer to rejections under 35 U.S.C. § 112, first 
paragraph. This is made evident in the continuing text in Brana, which explains the correlation between 
35 U.S.C. §§ 101 and 112, first paragraph. The Federal Circuit concluded: 

FDA approval, however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
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pharmaceutical inventions, necessarily includes the expectation of further research and 
development. The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase II testing in order 
to prove utility, the associated costs would prevent many companies from obtaining 
patent protection on promising new inventions, thereby eliminating an incentive to 
pursue, through research and development, potential cures in many crucial areas such 
as the treatment of cancer. 

Brana at 1442-1443, citations omitted. In assessing the question of whether undue experimentation 
would be required in order to practice the claimed invention, the key term is "undue", not 
"experimentation". In re Angstadt and Griffin, 190 USPQ 214 (C.C.P.A. 1976). The need for 
some experimentation does not render the claimed invention unpatentable. Indeed, a considerable 
amount of experimentation may be permissible if such experimentation is routinely practiced in the art. 
In re Angstadt and Griffin, supra; Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd, 18 USPQ2d 
1016 (Fed. Cir. 1991). As a matter of law, it is well settled that a patent need not disclose what is well 
known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). 

Finally, with regards to the issue of due process, while Appellants are well aware of the new 
Utility Guidelines set forth by the USPTO, Appellants respectfully point out that the current rules and 
regulations regarding the examination of patent applications is and always has been the patent laws as 
set forth in 35 U.S.C. and the patent rules as set forth in 37 C.F.R., not the Manual of Patent 
Examination Procedure or particular guidelines for patent examination set forth by the USPTO. 
Furthermore, it is the job of the judiciary, not the USPTO, to interpret these laws and rules. Appellants 
are unaware of any significant recent changes in either 35 U.S.C. § 101, or in the interpretation of 
35 U.S.C. § 101 by the Supreme Court or the Federal Circuit that is in keeping with the new Utility 
Guidelines set forth by the USPTO. This is underscored by numerous patents that have been issued 
over the years that claim nucleic acid fragments that do not comply with the new Utility Guidelines. As 
examples of such issued U.S. Patents, the Board is invited to review U.S. Patent Nos. 5,817,479 
(Exhibit O), 5,654,173 (Exhibit P) v and 5,552,281 (Exhibit Q; each of which claims short 
polynucleotides), and recently issued U.S. Patent No. 6,340,583 (Exhibit R; which includes no 
working examples), none of which contain examples of the "real-world" utilities that the Examiner 
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seems to be requiring. As issued U.S. Patents are presumed to meet all of the requirements for 
patentability, including 35 U.S.C. §§ 101 and 112, first paragraph (see Section Vffl(B), below), 
Appellants submit that the present polynucleotides must also meet the requirements of 
35 U.S.C. § 101. While Appellants agree that each application is examined on its own merits, 
Appellants are unaware of any changes to 35 U.S.C. § 101, or in the interpretation of 35 U.S.C. § 101 
by the Supreme Court or the Federal Circuit, since the issuance of these patents that render the subject 
matter claimed in these patents, which is similar to the subject matter in question in the present 
application, as suddenly non-statutory or failing to meet the requirements of 35 U.S.C. § 101. Given 
the rapid pace of development in the biotechnology arts, it is difficult for the Appellants to understand 
how an invention fully disclosed and free of prior art at the time the present application was filed, could 
somehow retain less utility and be less enabled than inventions in the cited issued U.S . patents (which 
were filed during a time when the level of skill in the art was clearly lower). Simply put, Appellants ' 
invention is more enabled and retains at least as much utility as the inventions described in the claims 
of the U.S. patents of record. Thus, holding Appellants to a different standard of utility would be 
arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Appellants submit that the rejection of claims 1 and 5-7 
under 35 U.S.C. § 101 must be overruled. 

B. Are Claims 1 and 5-7 Unusable Due to a Lack of Patentable Utility? 

The Final Action and Advisory Action maintain the rejection of claims 1 and 5-7 under 35 
U.S.C. § 1 12, first paragraph, since allegedly one skilled in the art would not know how to use the 
invention, as the invention allegedly is not supported by either a clear asserted utility or a well- 
established utility. 

The arguments detailed above in Section VIII(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
determined that the utility requirement of Section 101 and the how to use requirement of Section 1 12, 
first paragraph, have the same basis, specifically the disclosure of a credible utility (In re Brana 7 supra; 
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In re Jolles, 628 R2d 1322, 1326 n.ll, 206 USPQ 885, 889 n.ll (CCPA 1980); In re Fouche, 
439 R2d 1237, 1243, 169 USPQ 429, 434 (CCPA 1971)), Appellants submit that as claims 1 and 5- 
7 have been shown to have "a specific, substantial, and credible utility", as detailed in Section VIII(A) 
above, the present rejection of claims 1 and 5-7 under 35 U.S.C. § 112, first paragraph, cannot stand. 

Appellants therefore submit that the rejection of claims 1 and 5-7 under 35 U.S.C. § 1 12, first 
paragraph, must be overruled. 
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IX. APPENDIX 

The claims involved in this appeal are as follows: 

1. An isolated nucleic acid molecule comprising the nucleotide sequence 
of SEQIDNO:23. 

5. An isolated nucleic acid molecule comprising a nucleotide sequence that encodes the amino 
acid sequence of SEQ ID NO:24. 

6. An expression vector comprising a nucleic acid sequence encoding the amino acid sequence 
of SEQ ID NO: 24. 

7. A cell comprising the expression vector of Claim 6. 
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X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's 
conclusion that claims 1 and 5-7 lack a patentable utility and are unusable by the skilled artisan due to a 
lack of patentable utility is unwarranted. It is therefore requested that the Board overturn the Final 
Action's rejections. 
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Date Lance K. Ishimoto Reg. No. 41,866 

Agent For Appellants 
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Two new genes from the human ATP-binding cassette 
transporter superfamily, ABCC11 and ABCC12, tandemly 
duplicated on chromosome 16ql2. 

Tammur J, Prades C, Arnould I, Rzhetsky A, Hutchinson A, Adachi M, 
Schuetz JD, Swoboda KJ, Ptacek LJ, Rosier M, Dean M, Allikmets R. 

Department of Biotechnology, Institute of Molecular and Cell Biology, 
Tartu University, Tartu, Estonia. 

Several years ago, we initiated a long-term project of cloning new human 
ATP-binding cassette (ABC) transporters and linking them to various 
disease phenotypes. As one of the results of this project, we present two new 
members of the human ABCC subfamily, ABCC11 and ABCC12. These 
two new human ABC transporters were fully characterized and mapped to 
the human chromosome 16ql2. With the addition of these two genes, the 
complete human ABCC subfamily has 12 identified members (ABCC1-12), 
nine from the multidrug resistance-like subgroup, two from the sulfonylurea 
receptor subgroup, and the CFTR gene. Phylogenetic analysis determined 
that ABCC11 and ABCC12 are derived by duplication, and are most closely 
related to the ABCC5 gene. Genetic variation in some ABCC subfamily 
members is associated with human inherited diseases, including cystic 
fibrosis (CFTR/ABCC7), Dubin-Johnson syndrome (ABCC2), 
pseudoxanthoma elasticum (ABCC6) and familial persistent 
hyperinsulinemic hypoglycemia of infancy (ABCC8). Since ABCC11 and 
ABCC 12 were mapped to a region harboring gene(s) for paroxysmal 
kinesigenic choreoathetosis, the two genes represent positional candidates 
for this disorder. 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAACZaWfX: 1219 aa 

>SEQ ID NO 23 human transporter 

ys /tmp/fastaDAADZaWfX library 
searching ./ tmp/f astaDAADZaWfX library 

1382 residues in 1 sequences' ' ■ . " 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: opt 

gi | 21729873 | ref |NP_115972.2 | ATP-binding cassette (1382) 4838 

»gi | 21729873 |ref (NP.115972 .2 | ATP-binding cassette, sub (1382 aa) 

initn: 7928 initl: 4838 opt: 4838 
Smith-Waterman score: 7606; 88.061% identity in 1382 aa overlap (1-1219:1- 
1382) . 

10 20 30 40 50 60 

SEQ MTRKRTYWVPNSSGGLVNRGIDIGDDMVSGLIYKTYTLQDGPWSQQERNPEAPGRAAVPP 

gi | 217 MTRKRTYWVPNS SGGLVNRG I DI GDDMVSGL I YKT YTLQDGPWSQQERNPEAPGRAAVP P 
10 20 30 40 50 60 

70 80 90 100 110 120 

S EQ WGKYDAALRTMI PFRPKPRFPAPQPLDNAGLFS YLTVSWLTPLMIQSLRSRLDENTI PPL 

gi I 217 WGKYDAALRTMI PFRPKPRFPAPQPLDNAGLFS YLTVSWLTPLMIQSLRSRLDENTI PPL 
70 80 ... 90. . 100. 110 120. 

. ' . 130 140 150 .160 170 180 

SEQ SVHDASDKNVQRLHRLWEEEVS RRG I EKAS VLLVMLRFQRTRL I FDALLGI C FC I ASVLG 

gi I 2 17 S VHDASDKNVQRLHRLWEEEVS RRG I EKAS VLLVMLRFQRTRL I FDALLG I C FC I ASVLG 
130 140 150 160 170 180 

190 200 210 220 230 240 

SEQ PILI I PKILEYSEEQLGNWHGVGLCFALFLSECVKSLSFSSSWI INQRTAIRFQAAVSS, 

gi I 217 PILIIPKILEYSEEQLGNWHGVGLCFALFLSECVKSLSFSSSWIINQRTAIRFRAAVSS 
190 200 210 220 '230 240 

250 260 270 280 290 300 

SEQ FAFEKLIQFKSVIHITSGEAISFFTGDVNYLFEGVCYGPLVLITCASLVICSISSYFIIG 
• •••••.•.>•••••■••••*••.•«*••••*»••••••••"••"•"•*"*•*••"••••* 

gi I 217 FAFEKLIQFKSVIHITSGEAISFFTGDVNYLFEGVCYGPLVLITCASLVICSISSYFIIG 
. 250 260 270 280 290 300 

310 320 330 340 350 360 

SEQ YTAFIAILCYLLVFPLEVFMTRMAVKAQHHTSEVSDQRIRVTSEVLTCIKLIKMYTWEKP 

gi|217 YTAF I AI LC YLLVFPLAVFMTRMAVKAQHHTS EVSDQRI RVTS EVLTC I KL I KMYTWEKP 
310 320 330 340 350 360 

370 380 390 400 410 420 

SEQ FAKI IEDLRRKERKLLEKCGLVQSLTS ITLFI IPTVATAVWVLIHTSLKLKLTASMAFSM 
I::::::::::::::::::::::::::::::::::::::::: — :: — *: — — — :::: 
gi | 217 FAKI IEDLRRKERKLLEKCGLVQSLTS ITLFI IPTVATAVWVLIHTSLKLKLTASMAFSM 
370 380 390 400 410 420 



430 440 450 460 470 480 

SEQ LASLNLLRLSVFFVPIAVKGLTNSKSAVMRFKKFFLQESPVFYVQTLQDPSKALVFEEAT 

gi | 217 LASLNLLRLSVFFVPIAVKGLTNSKSAVMRFKKFFLQESPVFYVQTLQDPSKALVFEEAT 
430 440 450 460 470 480 

490 500 510 520 530 540 

SEQ LSWQQTCPGIVNGALELERNGHASEGMTRPRDALGPEEEGNSLGPELHKINLWSKGMML 

gi 1 217 LSWQQTCPGIWGALELERNC3HASEGMTRPRDALGPEEEGNSLGPELHKINLWSKGMML 
-' . .490 . 500 510 520 530 540 

550 560 570 580 590 600 

SEQ' • GVCGNTGSGKSSLLSAILEEMHLLEGSVGVQGSLAYVPQQAWIVSGNIRENILMGGAYDK 

gi I 217 GVCGNTGSGKSSLLSAILEEMHLLEGSVGVQGSLAYVPQQAWIVSGNIRENILMGGAYDK 
550 56.0 570 580 590 600 

610 620 630 640 650 660 

SEQ ARYLQVLHCCSLNRDLELLPFGDMTEIGERGLNLSGGQKQRISLARAVYSDRQIYLLDDP 

gi | 217 ARYLQVLHCCSLNRDLELLPFGDMTEIGERGLNLSGGQKQRISLARAVYSDRQIYLLDDP 
610 620 630 640 650 660 

670 680 690 700 710 720 

SEQ - L S AVDAHVGKHIFEEC I KKTLRGKTWLVTHQLQ YLEFCGQ I ILLENGKIC ENGTHS ELM 

gi | 217 L S AVDAHVGKH I FEEC I KKTLRGKTWLVTHQLQ YLEFCGQI ILLENGKIC ENGTHS ELM 
670 680 690 700 710 720 

730 

S EQ QKKGKYAQLIQKMHKEATS - - . — - — 

gi | 217 QKKGKYAQLTQKMHKEATSDMLQDTAKIAEKPKVESQALATSLEESLNGNAVPEHQLTQE 
730 ; 740 . 750 760 770 780 



SEQ 

gi | 217 EEMEEGSLSWRVYHHYIQAAGGYMVSCIIFFFVVLIWLTIFSFWWLSYWLEQGSGTNSS 
790 800 810 820 830 840 



SEQ 

gi | 217 RESNGTMADLGNIADNPQLSFYQLVYGLNALLLICVGVCSSGIFTKVTRKASTALHNKLF 
850 860 870 880 890 900 

740 750 760 770 780 790 

SEQ --VFRCPMSFFDTIPIGRLLNCFAGDLEQLDQLLPIFSEQFLVLSLMVIAVLLIVSVLSP 

gi | 217 NKVFRCPMSFFDTIPIGRLLNCFAGDLEQLDQLLPIFSEQFLVLSLMVIAVLLIVSVLSP 
910 920 930 940 950 960 

800 810 820 830 840 850 

SEQ YILLMGAIIMVICFIYYMMFKKAIGVFKRLENYSRSPLFSHILNSLQGLSSIHVYGKTED 

gi | 217 YILLMGAIIMVICFIYYMMFKKAIGVFKRLENYSRSPLFSHILNSLQGLSSIHVYGKTED 
970 980 , 990 1000 1010 1020 



860 870 880 890 " 900 910 

SEQ FISQFKRLTDAQNNYLLLFLSSTRWMALRLEIMTNLVTLAVALFVAFGISSTPYSFKVMA 

gi I 217 FISQFKRLTDAQNNYLLLFLSSTRWMALRLEIMTNLVTLAVALFVAFGISSTPYSFKVMA 
1030 1040 1050 1060 1070 1080 

920 930 940 950 960 970 

SEQ VNIVLQLASSFQATARIGLETEAQFTAVERILQYMKMCVSEAPLHMEGTSCPQGWPQHGE 

gi I 217. VNIVLQLASSFQATARIGLETEAQFTAVERILQYMKMCVSEAPLHMEGTSCPQGWPQHGE 
.1090: 1100 •. 1110 • - 1120. . . 1130 -1140 

980 990 1000 1010 1020 1030 

SEQ HFQDYHMKYRDNTPTVLHGINLTIRGHEWGIVGRTGSGKSSLGMALFRLVEPMAGRIL 

gi I 217 TIFQDYHMKYRDNTPTVLHGINLTIRGHEWGIVGRTGSGKSSLGMALFRLVEPMAGRIL 
1150 1160 1170 1180 1190 1200 

1040 . 1050 1060 1070 1080 1090 

SEQ IDGVDICSIGLEDLRSKLSVIPQDPVLLSGTIRFNLDPFDRHTDQQIWDALERTFLTKAI 



gi | 2 17 IDGVDICSIGLEDLRSKLSVIPQDPVLLSGTIRFNLDPFDRHTDQQIWDALERTFLTKAI 
1210 1220 1230 1240 1250 1260 

1100 1110 1120 1130 1140 1150 

SEQ SKFPKKLHTDWENGGNFSVGERQLLCIARAVLRNSKIILIDEATASIDMETDTLIQRTI 

gi | 217 SKFPKKLHTDWENGGNFSVGERQLLCIARAVLRNSKIILIDEATASIDMETDTLIQRTI 
1270 1280 1290 1300 1310 1320 

1160 1170 1180 1190 1200 1210 

SEQ REAFQGCTVLVIAHRVTTVLNCDHILVMGNGKWEFDRPEVLRKKPGSLFAALMATATSS 

gi | 217 - REAFQGCTVLVI AHRVTTVLNCDHILW S 
- 1330 1340. 1350; .1360 1370 . 1380^ 



SEQ LR 
gi|217 LR 



1219 residues in 1 query sequences 
1382 residues in 1 library sequences 
Scomplib [.version 3.3t05 March 30, 2000] 

start: Mon^Nov 11 10:23:05.2002 done: Mon Nov 11 10:23:06 2002 
Scan time: 0.050 Display time: 2.400 
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1: NP_1 15972. ATP-binding"casse...[gi:21729873] 



Links 



LOCUS 

DEFINITION 



ACCESSION 

VERSION 

DBSOURCE 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 
MEDLINE 
; PUBMED 

REFERENCE 
AUTHORS 



TITLE 



JOURNAL 
MEDLINE 
PUBMED 
REFERENCE 
AUTHORS 
TITLE 



JOURNAL • 
MEDLINE 
PUBMED 
REFERENCE 
AUTHORS 
TITLE 

JOURNAL 
MEDLINE 
PUBMED 
COMMENT 



ABCC11 - 1382 aa linear PRI 05-NOV-2002 

ATP-binding cassette, sub-family C, member 11 isoform a; 
multi-resistance protein 8; ATP-binding cassette transporter MRP 8 ; 
ATP-binding cassette protein Cll [Homo sapiens] . 
NP_115972 

NP_115972.2 GI:21729873 
REFSEQ: accession NM 032583 .2 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (residues 1 to 1382) 

Dean,M., Rzhetsky,A. and Allikmets, R. • 

The human ATP-binding cassette (ABC) transporter superfamily 
Genome Res. 11 (7), 1156-1166 (2001) 
21329047 
11435397 . 

2 (residues 1 to 1382) 

Tammur,J., Prades,C, Arnould/I., Rzhetsky,A., Hutchinson, A. , 
Adachi,M., Schuetz , J . D . , Swoboda , K . J . ,- Ptacek,L.J., Rosier , M. , 
Dean,M. and Allikmets , R. 

Two new genes from the human ATP-binding cassette transporter 
superfamily, ABCC11 and ABCC12, tandemly duplicated on chromosome 
16ql2 

Gene 273 (1) , 89-96 (2001) 

21376129 

11483364 

3. (residues 1 to 1382) 

Bera,T.K., Lee,S., Salvatore, G . , Lee,B. and Pas tan, I. 
MRP8, a new member of ABC transporter superfamily, identified by 
EST database mining and gene prediction program, is highly 
expressed in breast cancer 
~Mol. Med, 7 (8)., 509-516.(2001) 
21475973 
11591886 

4 (residues 1 to 1382) 

Yabuuchi,H., Shimizu,H., Takayanagi , S . and I.shikawa,T. 

Multiple splicing variants of two new human ATP-binding cassette 

transporters, ABCC11 and ABCC12 

Biochem. Biophys . Res. Commun. 288 (4), 933-939 (2001) 

21547789 

11688999 

REVIEWED REFSEQ : This record has been curated by NCBI staff. The 
reference sequence was derived from AF367202 . 1 . 
On Jul 11, 2002 this sequence version replaced gi: 14211905. 
Summary: The protein encoded by this gene is a member of the 



FEATURES 

source 



Protein 



Region 



Region 



Region 



Region 



Region 



superfamily of ATP-binding cassette (ABC) transporters. ABC 
proteins transport various molecules across extra- and 
intra-cellular membranes. ABC genes are divided into seven distinct 
subfamilies (ABCl, MDR/TAP, MRP, ALD, OABP, GCN20, White). This ABC 
full transporter is a member of the MRP subfamily which is involved 
in multi-drug resistance. It is expressed at low levels in all 
tissues, except kidney, spleen, and colon. This gene and family 
member ABCC12 are determined to be derived by duplication and are 
both localized to chromosome 16ql2.1. Their chromosomal 
localization, -potential, function, and expression patterns identify 
them as candidates for paroxysmal kinesigenic choreoathetosis, a 
disorder characterized' by attacks of involuntary movements and..- 
postures, chorea, and dystonia. Multiple alternatively spliced 
transcript variants have been described for this gene. 
Transcript Variant: This variant (1), as well as variant 2, encodes 
the predominant isoform (a) . 

Location/Qualifiers 

1..1382 

/organism="Homo sapiens" 
/db_xref = " taxon : 9 606 " 
/ chr omo s ome = " 1 6 " 
/map="16ql2.1" 
1..1382 

/product = "ATP-binding cassette, sub-family C, member 11 
isoform a" 

/note= "multi-resistance protein 8; ATP-binding cassette 
transporter MRP 8 ; ATP-binding cassette protein Cll" 

163. .427 . ; 

/region_name="ABC transporter transmembrane region. This 
family represents a unit of six transmembrane- helices. 
Many members of the ABC transporter family (pfam00005) 
have, two such regions." 
. ./note= "ABC_membrane" 

/ db_xr e f = " CDD : £fam00664 " 
536.. 691 

/region_name="ATPases associated with a variety of 
cellular activities" 
/note= n AAA" 

/db_xref = " CDD : smart00382 " 
537. .708 

/region_name="ABC transporter. ABC transporters for a 
large family of proteins responsible for translocation of 
a variety of compounds across biological membranes. ABC 
transporters are the largest family of proteins in many 
completely sequenced bacteria. ABC transporters are 
composed of two copies of this domain and two copies of a 
transmembrane domain pfamp0664 . . These four domains may 
belong to a single polypeptide or belong in different 
polypeptide chains" 
/note="ABC_tran" 
/ db_xr e f = " CDD : £fam00<K)5 " 

849. .1094 . t 

/region_name= n ABC transporter transmembrane region. This 
family represents a unit of six transmembrane helices. 
Many members of the ABC transporter family (pfamOOOOS) 
have two such regions" 
/note= " ABC_membrane ■ 
/Hh_vref= n CDD : pf am006 64 " 
1168. .1360 

/region_name="ATPases associated with a variety of 



4 



cellular activities" 

. /note="AAA" 
/ db_xr e f = " CDD : smart00382 " 
Region 1169.. 1351 

/region„name= H ABC transporter. ABC transporters for a 
large family of proteins responsible for translocation of 
a variety of compounds across biological membranes. ABC 
transporters are the largest family of proteins in many 
completely sequenced bacteria. ABC transporters are 
composed of two copies of this domain and two copies of a 
. transmembrane domain pfam00664 . These four domains may. 

■belong .to a single polypeptide or belong in different 
polypeptide chains" 
/note="ABC_tran n 
/db xref="CDD: pfam00005 " 
CDS 1..1382. 

/gene= n ABCCll n 

/coded_by="NM_032583.2:79. .4227" 

/note=" transporter" 

/ db_xre f = w Locus ID : 85320 " 

/db xref="MIM: 607040 " 

ORIGIN 

1 mtrkrtywvp nssgglvnrg idigddmvsg liyktytlqd gpwsqqernp eapgraavpp 
61 wgkydaalrt mipfrpkprf papqpldnag lfsyltvswl tplmiqslrs rldentippl 
121 svhdasdknv qrlhrlweee vsrrgiekas vllvmlrfqr trlifdallg icfciasvlg 
181 piliipkile yseeqlgnw hgvglcfalf Isecvkslsf ssswiinqrt airfraavss 
241 f af ekliqfk svihitsgea isfftgdvny lfegvcygpl vlitcaslvi csissyfiig 
301 ytafiailcy llvfplavfm trmavkaqhh tsevsdqrir vtsevltcik likmytwekp 
361 .fakiiedlrr kerkllekcg Ivqsltsitl fiiptvatay wvlihtslkl kltasmafsm 
421 laslnllrls vffvpiavkg ltnsksavmr fkkfflqesp vfyvqtlqdp skalvfeeat 
481 lswqqtcpgi vngalelern ghasegmtrp rdalgpeeeg nslgpelhki nlyvskgmml 
541 gvcgntgsgk ssllsailee mhllegsvgv qgslayvpqq awivsgnire nilmggaydk 
601 arylqvlhcc slnrdlellp . f gdmteiger glnlsggqkq rislaravys drqiyllddp 
661 lsavdahvgk hifeecikkt lrgktwlvt hqlqylefcg qiillengki cengthselm 
721 qkkgkyaqli qkmhkeatsd mlqdtakiae kpkvesqala tsleeslngn avpehqltqe 
781 eemeegslsw rvyhhyiqaa ggymvsciif ffwlivflt ifsfwwlsyw leqgsgtnss 
841 resngtmadl gniadnpqls fyqlvyglna lllicvgvcs sgiftkvtrk astalhnklf 
901 nkvfrcpmsf fdtipigrll ncfagdleql dqllpifseq flvlslmvia vllivsvlsp 
961 yillmgaiim vicfiyymmf kkaigvfkrl enysrsplfs hilnslqgls sihvygkted 
1021 fisqfkrltd aqnnylllfl sstrwmalrl eimtnlvtla valfvafgis stpysfkvma 
1081 vnivlqlass fqatarigle teaqftaver ilqymkmcvs eaplhmegts cpqgwpqhge 
1141 iifqdyhmky rdntptvlhg inltirghev vgivgrtgsg,. ksslgmalf r lvepmagril 
1201 idgvdicsig ledlrsklsv ipqdpvllsg tirfnldpfd rhtdqqiwda lertfltkai 
1261 skfpkklhtd wenggnfsv gerqllciar avlrnskiil ideatasidm etdtliqrti 
1321 reafqgctvl viahrvttvl ncdhilvmgn gkwefdrpe vlrkkpgslf aalmatatss 
1381 lr 
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in PubMed Central 



Impaired 2 f ,3 f -dideoxy-3 f -thiacytidine accumulation in T- 
lymphoblastoid cells as a mechanism of acquired resistance 
independent of multidrug resistant protein 4 with a possible 
role for ATP-binding cassette Cll. 

Turriziani Ch Schuetz JD . Focher F , Scagnolari C . Sampath J . Adachi 
M, Bambacioni F , Riva E . Antonelli G . 

Department of Experimental Medicine and Pathology, University t! La 
Sapienza", 00185 Rome, Italy. 

Cellular factors may contribute to the decreased efficacy of chemotherapy in 
HIV infection. Indeed, prolonged treatment with nucleoside analogues, such 
as azido thymidine (AZT), 2',3 ! -deoxycytidine or 9-(2- 
phosphonylmethoxyethyl)adenine, induces cellular resistance. We have 
developed a human T lymphoblastoid cell line (CEM 3TC) that is 
selectively resistant to the antiproliferative effect of 2 f ,3 ! -dideoxy-3 f - 
thiacytidine (3TC) because the CEM 3TC cells were equally sensitive to 
AZT, as well as the antimitotic agent, vinblastine. The anti-retroviral 
activity of 3TC against HIV-1 was also severely impaired in the CEM 3TC 
cells. Despite similar deoxycytidine kinase activity and unchanged uptake of 
nucleosides such as AZT and 2 ! -deoxycytidine, CEM 3TC had profoundly 
impaired 3TC accumulation. Further studies indicated that CEM 3TC 
retained much less 3TC. However, despite a small overexpression of 
multidrug resistance protein (MRP) 4, additional studies with cells 
specifically engineered to overexpress MRP4 demonstrated there was no 
impact on either 3TC accumulation or efflux. Finally, an increased 
expression of the MRP5 homologue, ATP-binding cassette Cll (ABCC11) 
was observed in the CEM 3TC cells. We speculate that the decreased 3TC 
accumulation in the CEM 3TC might be due to the upregulation of 
ABCC11. 
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NP__660187 
HEFSEQ: accession m 145186.1 



REFERENCE 
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TITLE 
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Dean,M. and Allikmets,R. 

Two new genes from the human ATP-binding cassette transporter 
superfamily, ABCC11 and ABCC12, tandemly duplicated on chromosome 
16ql2 

Gene 273 (1), 89-96 (2001) 

11483364 

5 (residues 1 to 1344) 
Dean,M. , Rzhetsky,A. and Allikmets,R 
The human ATP-binding cassette (ABC) 
Genome Res. 11 (7), 1156-1166 (2001) 

11435397 . 

REVIEWED REFSEQ : This record has been curated by NCBI staff. The 
reference sequence was derived from AF411579. 1 . 



transporter superfamily 



FEATURES 

source 



Protein 



Region 



Region 



Region 



Region 



Summary: The protein encoded by this gene is a member of the 
superfamily of ATP-binding cassette (ABC) transporters. ABC 
proteins transport various molecules across extra- and 
intra-cellular membranes. ABC genes are divided into seven distinct 
subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This ABC 
full transporter is a member of the MRP subfamily which is involved 
in multi-drug resistance. It is expressed at low levels in all 
tissues, except kidney, spleen, and colon. This gene and family 
member ABCC12 are determined to be derived by duplication and are 
both localized to chromosome 16ql2.1. Their chromosomal 
localization, potential function, and expression patterns identify 
them as candidates for paroxysmal kinesigenic choreoathetosis , a 
disorder characterized by attacks of involuntary movements and 
postures, chorea, and dystonia. Multiple alternatively spliced 
transcript variants have been described for this gene. 

Transcript Variant: This variant (3) lacks an alternate in-frame 
exon compared to variant 1, resulting in a shorter protein (isoform 
b) , compared to isoform a. 

Location/Qualifiers 

1. .1344 

/organism="Homo sapiens" 
/ db_xr e f = " t axon : 9606 " 
/ chromosome= " 16 " 
/map="16ql2.1" 
1..1344 

/product= "ATP-binding cassette, sub-family C, member 11 
isoform b" 

/note="multi -resistance protein 8; ATP-binding cassette 
transporter MRP 8 ; ATP-binding cassette protein Cll" 
/calculated__mol_wt=14 9963 
150.. 732 

/region__name= "ABC- type multidrug transport system, ATPase 
and permease components [Defense mechanisms] " 
/note="MdlB" 
/ db_xr e f = " CDD : COG1132 " 
163. .>431 

/region_name="ABC transporter transmembrane region" 
/note= " ABC__membrane " 
/ db_xr e f = « CDD : 40747 " 
517. .>709 

/region_name= "Domain 1 of the ABC subfamily C" 
/note="ABC_subf amilyC_domainl " 
/ db_xr e f = " CDD : 48283 " 
849..>1094 

/region_name="ABC transporter transmembrane region" 
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/note= " ABC_membrane " 

/db_xref = "CDD : 40747 " 
Region 853 . . 1338 

/region_name= "ABC -type multidrug transport system, ATPase 

and permease components [Defense mechanisms] " 

/note="MdlB n 

/db_xref = "CDD : COG1132 " 
Region 1141..>1314 

/region_name= "Domain 1 of the ABC subfamily C" 

/note="ABC_subf amilyC_domainl" 

/ db_xr e f = " CDD : 48283 " 
CDS 1. . 1344 

/gene="ABCCll" 

/coded_by="NM_145186 . 1:79. .4113" 
/db xref="GeneID: 85320 " 
/db xref="MIM: 607040 " 

ORIGIN 

1 mtrkrtywvp nssgglvnrg idigddmvsg liyktytlqd gpwsqqernp eapgraavpp 
61 wgkydaalrt mipfrpkprf papqpldnag lfsyltvswl tplmiqslrs rldentippl 
121 svhdasdknv qrlhrlweee vsrrgiekas vllvmlrfqr trlifdallg icfciasvlg 
181 piliipkile yseeqlgnw hgvglcfalf Isecvkslsf ssswiinqrt airfraavss 
241 fafekliqfk svihitsgea isfftgdvny Ifegvcygpl vlitcaslvi csissyfiig 

3 01 ytafiailcy llvfplavfm trmavkaqhh tsevsdqrir vtsevltcik likmytwekp 
361 fakiiedlrr kerkllekcg lvqsltsitl fiiptvatav wvlihtslkl kltasmafsm 
421 laslnllrls vffvpiavkg ltnsksavmr fkkfflqesp vfyvqtlqdp skalvfeeat 

4 81 lswqqtcpgi vngalelern ghasegmtrp rdalgpeeeg nslgpelhki nlwskgmml 
541 gvcgntgsgk ssllsailee mhllegsvgv qgslayvpqq awivsgnire nilmggaydk 
601 arylqvlhcc slnrdlellp fgdmteiger glnlsggqkq rislaravys drqiyllddp 
661 lsavdahvgk hifeecikkt lrgktwlvt hqlqylefcg qiillengki cengthselm 
721 qkkgkyaqli qkmhkeatsd mlqdtakiae kpkvesqala tsleeslngn avpehqltqe 
781 eemeegslsw rvyhhyiqaa ggymvsciif ffwlivflt ifsfwwlsyw leqgsgtnss 
841 resngtmadl gniadnpqls fyqlvyglna lllicvgvcs sgiftkvtrk astalhnklf 
901 nkvfrcpmsf fdtipigrll ncfagdleql dqllpifseq flvlslmvia vllivsvlsp 
961 yillmgaiim vicfiyymmf kkaigvfkrl enysrsplfs hilnslqgls sihvygkted 

1021 fisqfkrltd aqnnylllfl sstrwmalrl eimtnlvtla valfvafgis stpysfkvma 
1081 vnivlqlass fqatarigle teaqftaver ilqymkmcvs eaplhmegts cpqgwpqhge 
1141 iifqdyhmky rdntptvlhg inltirghev vgivgrtgsg ksslgmalfr lvepmagril 
1201 idgvdicsig ledlrsklsv ipqdpvllsg tirfnldpfd rhtdqqiwda lertfltkai 
12 61 ilideatasi dmetdtliqr tireafqgct vlviahrvtt vlncdhilvm gngkwefdr 
1321 pevlrkkpgs Ifaalmatat sslr 
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Comparison of the Amino Acid Sequences of 



SEQ ID NO; 24 and NP_660187 



FASTA searches a protein or DNA sequence data bank 

version 3;3t05 March 30, 2000 
Please cite : • : ' . 

W.R. Pearson & D.J. Lipman PNAS (1988) 8.5:244.4-2448 . 

/tmp/fastaGAABJayWj : 1219 aa 
>seqid24 ' 

vs /tmp/fastaHAACJayWj library 1 
searching /tmp/fastaHAACJayWj library 

1344 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.034 
The best scores are: opt 
NP_660187 ACCESSION :NP_6 60187 NID: gi 21729876 re (1344) 4838 

>>NP_660187 ACCESSION:NP_660187 NID: gi 21729876 ref NP„ (1344 aa) 

initn: 7156 initl: 4838 opt: 4838 
Smith-Waterman score: 7274; 85.311% identity in 1382 aa overlap (1-1219 
1344) 

10 20 30 40 50 60 

MTRKRTYWPNSSGGLVNRGIDIGDDWSGLIYKTYTLQDGPWSQQERNPEAPGRAAVPP 

MTRKRTYWVPNSSGGLVNRGIDIGDDMVSGLIYKTYTLQDGPWSQQERNPEAPGRAAVPP 
10 20 30 40 50 60 

70 80 90 100 110 120 

seqid2 WGKYDAALRTMIPFRPKPRFPAPQPLDNAGLFSYLTVSWLTPLMIQSLRSRLDENTIPPL 

NP_6 6 0 WGKYDAALRTMIPFRPKPRFPAPQPLDNAGLFSYLTVSWLTPLMIQSLRSRLDENTIPPL 
70 80 90 100 110 120 

130 140 150 160 170 180 

seqid2 SWDASDKNVQRLHRLWEEEVSRRGIEKASVLLV^ 

NP_660 SVTiDASDKWQRLHRLWEEEVSRRGIEKASVLLVmiRFQRTRLIFDALLGICFCIASVLG 
130 • 140 150 160 170 180 

190 200 210 220 230 .240 

PILIIPKILEYSEEQLGNWHGVGLCFALFLSECVKSLSFSSSWIINQRTAIRFQAAVSS 

PILIIPKILEYSEEQLGNWHGVGLCFALFLSECVKSLSFSSSWIINQRTAIRFRAAVSS 
190 200 210 220 230 240 

250 260 270 280 290 300 

seqid2 FAFEKLIQFKSVIHITSGEAISFFTGDVNYLFEGVCYGPLVLITCASLVICSISSYFIIG 



NP_660 FAFEKLIQFKSVIHITSGEAISFFTGDVNYLFEGVCYGPLVLITCASLVICSISSYFIIG 
250 260 270 280 290 300 



seqid2 
NP_660 



seqid2 
NP_660 



310 320 330 340 350 360 

seqid2 YTAF I A I LC YLLVF PLEVFMTRMAVKAQHHTS EVSDQRI RVTS EVLTC I KL I KMYTWEKP 



NP_6 6 0 YTAF I AILC YLLVF PLAVFMTRMAVKAQHHTS EVSDQRI RVTS EVLTC I KL I KMYTWEKP 
310 320 330 340 350 360 

370 380 390 400 410 420 

seqid2 FAKIIEDLRRKERKLLEKCGLVQSLTSITLFIIPTVATAVWVLIHTSLKLKLTASMAFSM 

NP_660 FAKI I EDLRRKERKLLEKCGLVQSLTS ITLF 1 1 PTVATAVWVLIHTSLKLKLTASMAF SM 
. 370 380 390 400 410 420 

430 440 450 460 470 ' 480 

segid2 LASLNLLRLSVFFVPIAVKGLTNSKSAVMRFKKFFLQESPVFYVQTLQDPSKALVFEEAT 

NP_6 6 0 LASLl^LRLSVFFVPIAVKGLTNSKSAVMRFKKFFLQESPVFYVQTLQDPSKALVFEEAT 
430 440 450 460 470 480 

490 500 510 520 530 540 

seqid2 LSWQQTCPGIVNGALELERNGHASEGMTRPRDALGPEEEGNSLGPELHKINLWSKGMML 

NP_6 6 0 LSWQQTCPGIWGALELERNGHASEGMTRPRDALGPEEEGNSLGPELHKINLVVSKGMl^ 
490 ■ 500 510 520 530 540 

550 560 570 580 ' 590 600 

seqid2 GVC GNTGSGKSSLLSAILEEMHLL EG SVGVQGSLAYVPQQAWIVSGNIRENILMGGAYDK 

NP_6 6 0 GVCGNTGSGKS SLL S AILEEMHLLEG SVGVQG SLAWPQQAWI VSGNI RENT LMGGAYDK 
550 560 570 580 590 600 

610 620 630 640 650 660 

. seqid2 ARYLQVLHCCSLNRDLELLPFGDMTEIGERGLNLSGGQKQRISLARAVYSDRQIYLLDDP 

NP_6 6 0 ARYLQVLHCCSLNRDLELLPFGDMTEIGERGLNLSGGQKQRISLARAVYSDRQIYLLDDP 
610 620 630 640 650 660 

670 680 690 700 710 720 

seqid2 L S AVDAHVGKH I FEEC IKKTLRGKTWLVTHQLQ YLEFCGQ I ILLENGKI C ENGTH S ELM 

NP_6 6 0 LSAVDAHVGKHIFEECIKKTLRGKTVVLVTHQLQYLEFCGQIILLENGKICENGTHSELM 
670 68.0 690 700 710 720 

730 

seqid2 QKKGKYAQL I QKMHKEATS 



NP_6 6 0 QKKGKYAQLIQKMHKEATSDMLQDTAKI AEKPKVESQALATSLEESLNGNAVPEHQLTQE 
730 740 750 760 770 780 



seqid2 

NP_660 EEMEEGSLSWRVYHHYIQAAGGYMVSCIIFFFVVLIVFLTIFSFWWLSYWLEQGSGTNSS 
790 800 810 820 830 840 



seqid2 " 

NP„6 6 0 RESNGTMADLGNIADNPQLSFYQLVYGLNALLLICVGVCSSGIFTKVTRKASTALHNKLF 
850 860 870 880 890 900 



740 750 760 770 780 790 

seqid2 — VFRCPMSFFDTIPIGRLLNCFAGDLEQLDQLLPIFSEQFLVLSLMVIAVLLIVSVLSP 



NP_6 6 0 NKVFRCPMSFFDTIPIGRLLNCFAGDLEQLDQLLPIFSEQFLVLSLMVIAVLLIVSVLSP 
910 920 930 940 950 960 

800 810 820 830 840 850 

seqid2 yiLLMGAIIMVICFIYYMMFKKAIGVFKRLENYSRSPLFSHILNSLQGLSSIHVYGKTED 

NP_660 YILLMGAIIWICFIYYMMFKKAIGVFKRLENYSRSPLFSHILNSLQGLSSIHVYGKTED 
970 980 990 1000 1010 . 1020 

860 ■ 870 880 . 890 900 910 . 

seqid2 FISQFKRLTDAQNNYLLLFLSSTRWMALRLEIMTNLVTLAVALFVAFGISSTPYSFKVMA 

NP_6 6 0 FISQFKPiTDAQNNYLLLFL 

1030 1040 1050 1060 1070 1080 

920 930 940 950 960 970 

seqid2 VNIVLQLASSFQATARIGLETEAQFTAVERILQYMKMCVSEAPLHMEGTSCPQGWPQHGE 

NP_660 WIVLQLASSFQATARIGLETEAQFTAVERILQYMKMCVSEAPLHMEGTSCPQGWPQHGE 
1090 1100 1110 1120 1130 1140 

980 990 1000 1010 1020 1030 

seqid2 IIFQDYHMKYRDNTPTVLHGINLTIRGHEWGIVGRTGSGKSSLGMALFRLVEPMAGRIL 

NP_660 IIFQDYHMKYRDNTPTVLHGINLTIRGHEWGIV'GRTGSGKSSLGMALFRLVEPMAGRIL 
1150 1160 1170 1180 1190 1200 

1040 . 1050 1060 1070 1080 1090 

seqid2 IDGVDICSIGLEDLRSKLSVIPQDPVLLSGTIRFNLDPFDRHTDQQIWDALERTFLTKAI 

NP_660 IDGVDICSIGLEDLRSKLSVIPQDPVLLSGTIRFNLDPFDRHTDQQIWDALERTFLTKAI 
1210 1220 1230 1240 1250 1260 

1100 1110 1120 1130 1140 1150 

seqid2 SKFPKKLHTDWENGGNFSVGERQLLCIARAVLRNSKIILIDEATASIDMETDTLIQRTI 

Np 660 IL I DEATAS I DMETDTLI QRTI 

~ 1270 1280 

1160 1170 1180 1190. 1200 1210 

seqid2 REAFQGCTVLVIAHRVTTVLNCDHIL^ 

NP„6 6 0 REAFQGCTVLVIAHRVTTVLNCDHIL^ 

1290 1300 1310 1320 1330 1340 



seqid2 LR 
NP_660 LR 
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Abstract 

Several years ago, we initiated a long-term project of cloning new human ATP-binding cassette (ABC) transporters and linking them to 
various disease phenotypes. As one of the results of this project, we present two new members of the human ABCC subfamily, ABCC1 1 and 
ABCC 12. These two new human ABC transporters were fully characterized and mapped to the human chromosome 16ql2. With the addition 
of these two genes, the complete human ABCC subfamily has 12 identified members (ABCC1-12), nine from the multidrug resistance-like 
subgroup, two from the sulfonylurea receptor subgroup, and the CFTR gene. Phylogenetic analysis determined that ABCC 11 and ABCC 12 
are derived by duplication, and are most closely related to the ABCC5 gene. Genetic variation in some ABCC subfamily members is 
associated with human inherited diseases, including cystic fibrosis (CFTR/ABCC7), Dubin-Johnson syndrome (ABCC2\ pseudoxanthoma 
elasticum (ABCC6) and familial persistent hyperinsulinemic hypoglycemia of infancy (ABCC8). Since ABCC11 and ABCC12 were mapped 
to a region harboring gene(s) for paroxysmal kinesigenic choreoathetosis, the two genes represent positional candidates for this disorder. 
© 2001 Elsevier Science B.V. All rights reserved. 

Keywords: ATP-binding cassette transporters; Mapping; Paroxysmal kinesigenic choreoathetosis 
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1. Introduction 

The ATP-binding cassette (ABC) transporter superfamily 
is one of the largest gene families and encodes a function- 
ally diverse group of membrane proteins involved in 
energy-dependent transport of a wide variety of substrates 
across membranes (Dean and Allikmets, 1995). Phyloge- 
netic analysis further divides human ABC transporters 
into seven subfamilies: ABCA (ABC1 subfamily), ABCB 
(MDR/TAP subfamily), ABCC (CFTR/MRP subfamily), 
ABCD (ALD subfamily), ABCE (OABP subfamily), 
ABCF (GCN20 subfamily), and ABCG (white subfamily) 
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(Allikmets et al., 1996; http://www.gene.ucl.ac.uk/users/ 
hester/abc.html). Most ABC proteins from eukaryotes 
encode so-called 'full transporters', each consisting of two 
ATP-binding domains and two transmembrane domains 
(Hyde et al, 1990). 

The human ABCC subfamily currently has ten identified 
members (ABCC 1-10), seven from the multidrug resis- 
tance-like (MRP) subgroup, two from the sulfonylurea 
receptor (SUR) subgroup, and the CFTR gene. MRP-like 
proteins are organic anion transporters, i.e. they transport 
anionic drugs, exemplified by methotrexate (MTX), as well 
as neutral drugs conjugated to acidic ligands, such as 
glutathione (GSH). glucuronate, or sulfate, and play a 
role in resistance to nucleoside analogs (Cui et al., 1999; 
Kool et al., 1999; Schuetz et al., 1999; Wijnholds et al., 
2000). Genetic variation in some ABCC subfamily 
members is associated with human inherited diseases, 
including cystic fibrosis (CFTR/ABCC7) (Riordan et al., 
1989), Dubin-Johnson syndrome (ABCC2) (Wada et al., 
1998), pseudoxanthoma elasticum (ABCC6) (Bergen et 
al., 2000; Le Saux et al., 2000) and familial persistent 
hyperinsulinemic hypoglycemia of infancy (ABCC8) 
(Thomas et al.,1995). 

Paroxysmal kinesigenic choreoathetosis (PKC; MIM# 
128200), the most frequent type of paroxysmal dyskinesia, 
is a disorder characterized by recurrent, frequent attacks of 
involuntary movements and postures, including chorea and 
dystonia, induced by sudden voluntary movements, stress, 
or excitement (for a detailed description of clinical and 
genetic features, see Swoboda et al., 2000). In most families 
it is inherited as an autosomal dominant trait with incom- 
plete penetrance. The gene locus has been mapped to human 
chromosome 16ql l-ql2 (Tomita et al., 1999; Bennett et al., 
2000). 

We initiated a long-term project of cloning new human 
ABC transporters and linking them to various disease 
phenotypes (Allikmets et al., 1996, 1997, 1999). As one 
of the results of this project, we present here two new 
members of the human ABCC subfamily, ABCC I J and 
ABCC12. 



2. Materials and methods 

2.7. Sequence analysis 

Searches of the GenBank HTGS database were performed 
with the TBLASTN and TBLASTP programs on the NCBI 
file server (http://www.ncbi.nlm.nih.gov) with the known 
ABC transporter nucleotide and protein sequences as 
queries. Potential transmembrane spanning segments were 
predicted with the TMAP program (http://bioweb.pas- 
teur.fr/seqanal/interfaces/tmap.html). Amino acid align- 
ments were generated with the PILEUP program included 
in the Genetics Computer Group (GCG) Package. The 
GRAIL and GeneScan programs on Genome Analysis Pipe- 
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line I (http://compbio.ornl.gov/GP/) were utilized to predict 
genomic structures of the new genes. 

2.2. cDNA cloning and determining the genomic structure 

Primers were designed from expressed sequence tag 
.(EST) clone sequences and from predicted cDNA sequences 
from 5' and 3 ; regions of genes. cDNA sequences of 
ABCCU and ABCC] 2 were confirmed by PCR amplifica- 
tion of testis or liver cDNA (Clontech). Sequencing was 
performed on the ABI 377 sequencer according to the 
manufacturer's protocols (Perkin Elmer). Positions of 
introns were determined by comparison between genomic 
(BAC AC007600) and cDNA sequences. The sequence of 
the ABCC]] and ABCC] 2 cDNA was deposited with the 
GenBank Database under the accession numbers AY040219 
and AY040220, respectively. 

2.3. Physical mapping 

The chromosomal localization of the human ABCC]] and 
ABCC]2 genes was determined by mapping on the Gene- 
Bridge4 radiation hybrid panel (Research Genetics), accord- 
ing to the manufacturer's protocol. 

2.4. Expression analysis 

Expression profiles of the human ABCC]] and ABCC12 
genes were determined by PCR on human Multiple Tissue 
cDNA (MTC™, Clontech) Panels 1 and II according to the 
manufacturer's instructions. Each MTC panel contains 
normalized, first-strand cDNA from eight human tissues/ 
cells: (I) heart, whole brain, placenta, lung, liver skeletal 
muscle, kidney and pancreas; (II) spleen, thymus, prostate, 
testis, ovary, small intestine, colon and peripheral blood 
leukocyte. The following primer pairs amplified specific 
gene products: ABCC11: forward S'-AGA ATG GCT 
GTG AAG GCT CAG CAT C-3', reverse S'-GTT CCT 
CTC CAG CTC CAG TGC-3'; ABCC12: forward 5'- 
GGT GAC AGA CAA GCG AGT TCA GAC AAT G-3\ 
reverse S'-CTT TGC TCC TCT GGG CCA GTG-3'. 

2.5. Cell lines 

The human erythroleukemia K562 cells were obtained 
form the American Tissue Culture Collection (Rockville, 
MD) and were cultured in RPMI-1640 medium supplemen- 
ted with 1 0% fetal calf serum and 2 mM L-glutamine. The 9- 
(2-phosphonylmethoxyethyl)adenine (PMEA) resistant 
cells, K562/PMEA, were derived as described earlier 
(Hatse et al., 1996), and were kindly provided by Dr Jan 
Balzarini (Rega Institute for Medical Research, Katholieke 
Universiteit, I^euven, Belgium). The T-lymphoblast cell 
lines CEM and (-)2 / ,3'-dideoxy-3 / -thiacytidine (3TQ- 
resistant CEM-3TC cells were provided by Dr Guido Anto- 
nelli (Department of Experimental Medicine and Pathology 
Virology Section, University "La Sapienza", Rome, Italy). 
The selection of the 3TC-resistant cell line and its pheno- 
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AECCll ■ 

ABCC12 
AECC5 MKDIEIGKBY : 

ABCC11 
ABCC12 YA1 
ABCC5 ~ 



ABCC11 vt^xx^m^k. 

ABCC12 ~- 

ABCC5 

ABCC11 SLBlLRXSvF 
ABCC12 lOWVi0*SIA 
ABCC5 VTilSJriTALK 



—Will KRTVWVPKSS GGLVNRGIDI GDXMVSQLIY K$mOtf?J^ SQQERNPEM* G^AAi 




ABCCll 
ABCC12 
ABCC5 



Fig. 1. Amino acid alignment of ABCC1 1, ABCC12, and ABCC5 proteins. Identical amino acids arc shaded and gaps are indicated by periods. Walker A and B 
motifs and the ABC transporter family signature sequence C are underlined and labeled with respective letters. Potential transmembrane spanning segments are 
given in bold type. 



typic properties will be described in detail in an upcoming 
publication. Another previously described pair of cell lines, 
CEMss and CEM-rl, were acquired from Dr Arnold Frid- 
land (Robbins et al., 1995). CEM-rl is highly resistant to 
PMEA due to an overexpression of ABCC4 (Schuetz et al., 
1999). Total RNA from these six cell lines (three pairs of 
wild-type and resistant cell lines) was isolated with TR1ZOL 
(GIBCO BRL), and RT-PCR was performed at varying 



cycle numbers with primers described in Section 2.4. The 
PCR products were subcloned into the pCR 2. 1 vector and 
verified by direct sequencing. 

2.6. Phylogenetic analysis 

Complete protein sequences were aligned with the 
CLUSTALW program (Thompson et al, 1994). The result- 
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Fig. 2. Splicing pattern comparison of the ABCCll and ABCC12 genes. Clear boxes represent exons and vertical lines define splice sites. Exon numbers for 
each gene are shown both above and below the drawing. Filled boxes indicate the exons encoding ABC domains. Regions in which the two genes show 
identical splicing patterns are indicated by dashed lines. 
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ing multiple alignment was analyzed with program 
NJBOOT by N. Takezaki (pers. commun.) implementing 
the neighbor-joining tree-making algorithm (Saitou and 
Nei, 1987). The Poisson correction for multiple hits (Zuck- 
erkandl and Pauling, 1965) was used as the distance 
measure between sequences for generating a phylogenetic 
tree. 



3. Results and discussion 



The ABCC4 and ABCC5 proteins confer resistance to 
nucleotide analogs, including PMEA and purine base 
analogs (Schuetz et aL, 1999; Wijnholds et ah, 2000). 
ABCC1, ABCC2 and ABCC3 transport drugs conjugated 
to GSH, glucuronate, sulfate and other organic anions, 
such as MTX (Cui et al., 1999; Kool et ah, 1999; Wijnholds 
et al., 2000). Since structurally related ABC proteins often 
transport similar substrates across cell membranes, it would 
be reasonable to suggest that ABCC1 1 and ABCC12 could 
share functional similarities with ABCC4 and/or ABCC5. 



3.7. Cloning and genomic structure ofABCCll and 
ABCC12 

Two new human ABC transporter gene sequences were 
detected on the bacterial artificial chromosome (B AC) clone 
#AC007600 from the GenBank HTGS database. cDNA 
sequencing, genomic structure prediction programs, and 
computer searches determined the sequence and genomic 
structure of both new genes belonging to the ABCC 
(MRP) subfamily. Only the combination of all these meth- 
ods allowed for the correct assembly of these genes which 
are closely related and highly conserved in evolution. 

The human ABCC 11 and ABCC 12 genes consists of 29 
exons. Exon sizes range from 72 to 252 bp for ABCC1 1 and 
from 73 to 279 bp for ABCC12. All exons were flanked by 
GT and AG dinucleotides consistent with the consensus 
sequences for splice junctions in eukaryotic genes (Table 
1). Of the 28 introns inABCCll, 18 are class 0 (where the 
splice occurs between codons), four are class 1 (where the 
codon is interrupted between the first and the second nucleo- 
tide), and six are class 2 (where the splice occurs between 
the second and the third nucleotide of the codon). For the 
ABCC12 gene these numbers are 16, six and six, respec- 
tively. The ABCC11 gene encodes a protein of 1382 amino 
acids, and ABCC 12 a protein of 1359 amino acids (Fig. 1). 
Topology predictions based on hydropathy profiles and 
comparison with other known ABC transporters suggest 
that both encoded proteins are full ABC transporters 
containing two ATP-binding domains (including Walker 
A and B domains, and signature motifs) and two transmem- 
brane domains (Fig. 1). The amino acid sequence of 
ABCC 11 is 40% identical to the human ABCC5 protein, 
33% identical to human ABCC4 and 32% identical to 
ABCC2 and ABCC3 proteins. The ABCC 12 protein is 
even more closely related to ABCC5 (42% identity on 
protein level; Fig. 1). 

The splicing pattern of the two new genes is very similar, 
especially towards the 3* end (Fig. 2), suggesting a close 
evolutionary relationship between these ABC transporters. 
The ABCC1 1 and ABCC12 proteins, as well as ABCC4 and 
ABCC5, are smaller than another well-known member of 
the subgroup, ABCC1 (MRP1), appearing to lack the extra 
N-terminal domain (Fig. 1) (Borst et al., 2000). It has been 
shown, however, that the extra N-terminal part of ABCC1 is 
not required for the transport function (Bakos et al., 1998). 



3.2. Expression ofABCCll and ABCC 12 in human tissues 
and nucleoside-resistant cell lines 

The expression patterns for the ABCC 11 and ABCC 12 
genes were examined by PCR on MTC panels (Clontech) 
with gene-specific primers resulting in about 500 bp PCR 
fragments (Fig. 3B). ABCC 11 was expressed in all tissues 
except kidney, spleen, and colon. The ABCC12 transcript 
was detected, at. much lower levels, only in testes, ovary, 
and prostate. The size for both transcripts was determined at 
approximately 5000 bp on MTN blots (Clontech, data not 
shown). The primers used in expression studies amplified 
the ABCC 11 cDNA from exon 7 to exon 10, resulting in a 
527 bp PCR fragment (Fig. 3B). In the case of lung, and 
occasionally some other tissues (data not shown), a smaller 
(419 bp) fragment was detected (Fig. 3B). Direct sequen- 
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Fig. 3. Chromosomal localization and expression analysis of the ABCC II 
and ABCC12 genes. (A) Human ABCC11 and ABCC12 genes, flanked by 
markers D16S3093 and D16S409, are separated by -200 kb, and organized 
in a head-to-tail fashion, with their 5' ends facing the centromere. Loci for 
ICCA, PKC, and their overlap, are defined by brackets. (B) Expression 
analysis of the human ABCC 11 and ABCC 12 genes by PCR on MTC 
panels. Lanes 1-16 represent cDNA from heart, brain, placenta, lung, 
liver, muscle, kidney, pancreas, spleen, thymus, testis, ovary, intestine, 
colon, leukocyte, and prostate, respectively. N, negative control; M, marker 
lane (1 kb Plus DNA Ladder). 
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cing of the PCR product determined that the shorter PCR 
product lacked exon 9 of the ABCC11 gene. Since these 
results were confirmed in repeated experiments, frequent 
skipping of ABCC11 exon 9 may occur in vivo. Exon skip- 
ping and alternative splicing events have been well-docu- 
mented for several ABC genes (Rickers et al., 1994; 
Bellincampi et al., 2001). 

Systematic analysis of the tissue source of the ABCC12 
ESTs from the public dbEST and the proprietary Incyte 
LifeSeq Gold databases indicates that 1 1/18 of the matching 
sequences are derived from various CNS origins, and the 
rest are from testis (three clones) and immune system (four 
clones). Similar analysis for the ABCC11 gene resulted in 29 
ESTs, with the majority being derived from breast tumor 
tissue (17). The others were from prostate (five clones), 
testis (three), CNS (two), and colon (two). Certain discre- 
pancies between the two expression profiling methods are 
often observed for low abundance transcripts, which have 
high tissue distribution selectivity. 

Since the new genes show extensive structural similarity 
to ABCC5 (and to a certain extent, ABCC4), we checked 
their expression in three pairs of cell lines, K562 and K562- 
PMEA, CEMss and CEM-rl, and CEM and CEM-3TC. The 
K562-PMEA and CEM-rl lines have been selected for 
resistance to PMEA, and the CEM-3TC for resistance to 



the cytidine nucleoside analog, 3TC. No difference was 
observed in expression levels of ABCC11 between the 
parental and PMEA-resistant cell lines. In contrast, the 
CEM-3TC cell line revealed a reproducible two- to three- 
fold increase in the expression of ABCCU, when compared 
to the parental line CEM (data not shown). This is a poten- 
tially interesting finding when one considers the close 
evolutionary relationship of ABCC11 and ABCC5 (Figs. 1 
and 4), and that a recent study by Borst and colleagues 
(Wijnholds et al., 2000) has demonstrated selective nucleo- 
tide analog transport by ABCC5. In addition, since the 
efflux-resistant phenotype of CEM-3TC can be explained 
only in part by ABCC4 overexpression (J.D.S., unpublished 
data), the higher expression of ABCCU in these cells 
warrants further investigation. 

5.5. Radiation hybrid mapping 

Radiation hybrid mapping placed ABCCU and ABCC12 
to the centromeric region of human chromosome 16, flanked 
by markers D16S3093 and D16S409 (Fig. 3A). The region 
encompasses 5.4 cM, or 132.5 cR, and could not be 
narrowed down further due to lack of recombination and/ 
or mapped polymorphic markers in this region. Both genes 
are most likely localized on chromosome 16ql2.1, since 
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Fig. 4. Phylogenetic relationship of genes in the ABCC subfamily. Complete protein sequences of all members of the ABCC subfamily were aligned with the 
CLUSTALW program. The distance measure is given in substitutions per amino acid. 
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they map much closer to the 16q marker D16S409 (13.24 
cR) than the 16p marker D16S3093 (119.40 cR) (Fig. 3A). 
The location of the new genes was confirmed also by LANL 
BAC mapping data, where the BAC clone #AC007600 was 
mapped to 16ql2.1 by the STS marker s9Bl (http:// 
www.jgi.doe.gov/JGI_home.html). ABCC11 and ABCC12 
are located tandemly, separated by about 200 kb, with 
their 5' ends facing towards the centromere (Fig. 3). Two 
more ABCC subfamily genes, ABCC1 and ABCC6, have 
been mapped to the short arm of the same chromosome, at 
16pl3.1 (Cole et al., 1992; Allikmets et al., 1996). The 3' 
ends of ABCC 1 and ABCC6 are only about 9 kb apart from 
each other so the genes face opposite directions (Cai et al., 
2000). 

3.4. Phylogenetic analyses 

Phylogenetic analyses of the ABCC subfamily proteins 
clearly demonstrate a relatively recent duplication of the 
ABCC11 and ABCC12 genes (Fig. 4). The resulting neigh- 
bor-joining tree shows with maximum confidence (100- 
level of bootstrap support) a close evolutionary relationship 
of the ABCC1 1/ ABCC 12 cluster with the ABCC5 gene (Fig. 
4). In addition, the analysis of the tree suggests a recent 
duplication of the ABCC8 and ABCC9 genes, while 
ABCC 10 seems to be one of the first genes to separate 
from the common ancestor. ABCC1, ABCC2, ABCC3, and 
ABCC6 genes constitute a well-defined sub-cluster, while 
the ABCC4 and CFTR (ABCC7) genes form another reliable 
subset despite apparent early divergence. 

3.5. ABCC11 and ABCC12 as candidate genes for PKC 

The locus for PKC has been assigned to 16pll.2-ql2.1, 
between markers D16S3093 and D16S416 (Tomita et al., 
1999; Bennett et al., 2000) (Fig. 3A). An overlapping locus 
has been predicted to contain the gene for infantile convul- 
sions with paroxysmal choreoathetosis (ICC A; Lee et al., 
1998). Expression analysis by PCR and by EST database 
mining suggests that the two genes are expressed in tissues 
(CNS, muscle) potentially involved in the etiology of PKC. 
In summary, chromosomal localization, potential function, 
and expression profiles make both genes promising candi- 
dates for PKC/ICCA. Preliminary analysis of the ABCC11 
gene has identified several single nucleotide polymorph- 
isms, including an amino acid-changing variant (56G > A, 
R19H) in the first exon. Complete screening of the ABCC11 
and ABCC12 genes for genetic variation in families segre- 
gating PKC is currently under way. 
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Two new human ABC transporters, ABCC11 and 
ABCC12, were cloned from a cDNA library of human 
adult liver. ABCC11 and ABCC12 genes consist of 30 
and 29 exons, respectively, and they are tandemly lo- 
cated in a tail-to-head orientation on human chromo- 
some 16ql2.1. The predicted amino acid sequences of 
both gene products show a high similarity with 
ABCC5. The transcripts of ABCC11 and ABCC12 genes 
were detected by PCR in various adult human tissues, 
including liver, lung, and kidney, and also in several 
fetal tissues. By searching cDNA libraries from vari- 
ous human tissues, we have identified alternative 
splicing variants of ABCC11 and ABCC12 genes at sig- 
nificantly high frequencies. One splice variant lacking 
the exon 28 corresponded to about 25% of total 
ABCC11 gene transcripts. Furthermore, four splicing 
variants encoding putatively short peptides were pre- 
dominant in ABCC12 gene transcripts. Those splicing 
variants may represent diverse biological functions of 

these ABC transporter genes. © 2001 Academic Press 

Key Words: ABC transporter; ABCC11; ABCC12; ge- 
netic polymorphism; alternative splicing; human chro- 
mosome 16. 



The ATP-binding cassette (ABC) transporters form 
one of the largest protein families and play a biologi- 
cally important role as membrane transporters or ion 
channel modulators (1, 2). Until now more than 48 
human ABC-transporter genes have been identified 
and sequenced (3, 4). Based on the arrangement of 

The cDNA sequences of ABCC11 and its transcript variant A as 
well as ABCC12 variants A, B, C and D have been registered in 
GenBank under the Accession Nos. AF367202, AF411579, 
AF395908, AF395909, AF411577, and AF411578, respectively. 

Abbreviations used: ABC, ATP-binding cassette; MRP, multidrug 
resistance-associated protein; PCR, polymerase chain reaction; GS-X 
pump, ATP-dependent glutathione S-conjugate export pump. 

1 To whom correspondence and reprint requests should be ad- 
dressed. Fax: +81-45-924-5838. E-mail: tishikaw@bio.titech.ac.jp. 



molecular structure components, i.e., the nucleotide 
binding domain and the topology of transmembrane 
domains, human ABC transporters are classified into 
seven different gene families (A to G) (2-4). Mutations 
of human ABC transporter genes have been reported to 
cause of certain genetic diseases, such as Tangier dis- 
ease (5-7), cystic fibrosis (8), Dubin-Johnson syndrome 
(9), Stargardt disease (10), and sitosterolemia (11). 

The ABCC gene family (according to the new nomen- 
clature of human ABC transporter genes) comprises 
the members of multidrug resistance-associated pro- 
teins (MRP) (12), sulfonylurea receptors (SUR) (13), 
and cystic fibrosis transmembrane conductance regu- 
lator (CFTR) (8). MRP1 (ABCC1 according to new no- 
menclature for human ABC transporter genes) was 
first identified by molecular cloning from human 
multidrug-resistant lung cancer cells (14). MRP1 en- 
codes one of the previously characterized GS-X pumps 
(15) that transport leukotriene C 4 (16) and drugs either 
conjugated with glutathione (GSH), glucuronide or sul- 
fate (17). In addition, MRP1 reportedly transports 
some anticancer drugs in an unmodified form together 
with GSH (18, 19). After the discovery of the MRP1 
gene, six MRP1 homologues have been identified. At 
present the human MRP subfamily consists of at least 
seven members (MRP1, MRP2/cMOAT, MRP3, MRP4, 
MRP5, MRP6, and MRP7) (2, 3, 12, 20, 21) and exhibits 
a wide spectrum of biological functions. Accumulating 
evidence shows that ABC transporters of the MRP 
subfamily are involved in transport of drugs as well as 
endogenous substances (12, 16-19, 22, 23). 

The draft sequence of the human genome has re- 
cently published (24, 25), and more than 50 of human 
ABC transporter genes have been anticipated to exist 
in the human genome (4). However, at present, because 
of the difficulty in the precise prediction of exon-intron 
boundaries using currently available software pro- 
grams, actual cloning and sequencing of cDNA is still a 
critical step for our understanding of the molecular 
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structure and function of novel ABC transporters. We , 
have recently discovered two novel ABC transporters, 
i.e., human ABCC11 and ABCC12 that belong to the 
ABCC gene family and are located on the chromosome 
16ql2.1. In the present study, we have analyzed mul- 
tiple splicing variants transcribed from ABCC 11 and 
ABCC 12 genes and herein demonstrate the gene fea- 
tures and expression profiles of these two ABC trans- 
porter genes in human organs. 

MATERIALS AND METHODS 

Cloning ofcDNA encoding human ABCCll and ABCC 12 and their 
splicing variants. The draft sequence of the human chromosome 16 
(GenBank Accession No, AC007600) was analyzed using the GEN- 
SCAN program (http://genes.mit.edu/GENSCAN.html) to predict ex- 
ons. EST clones were extracted from the currently available EST 
database to find partial sequences of ABCCll and ABCC 12. 

To clone full length and splicing variant of ABCCll cDNA, the 
following three sets of PCR primers were designed: the 5'-part (Cll-1 
forward primer: 5'-ATGGC1TCGCGCTGCTCTCT-3' and Cll-1 back- 
ward primer: 5 '-CCTCAGATGTGTGATGCCGAGCCTT-3 ' ), the middle 
part (Cll-2 forward primer: 5 '-GGTATTC ATGAC AAGAATGG-3 ' , and 
Cll-2 backward primer: 5 ' -GACGATCAGCACCACGAAGA-3 ' ), and 
the 3'-part (Cll-3 forward primer: 5'-CCTOA-GTTGGAGGGTCTAC- 
3', Cll-3 backward primer: 5 ' -AAGTAGCCTATTCC AGGG1TT-3 ' ). 
PCR was performed using human adult liver cDNA (Clontech, Palo 
Alto, CA) and the Ex Taq polymerase (Takara, Japan), where the PCR 
consisted of 30 cycles of 94°C for 30 s, 60°C for 30 s, and 72°C for 2 min. 
In addition, the 3 '-portion of noncoding region was cloned by 3 '-rapid 
amplification of cDNA ends (3'RACE) using human liver Marathon- 
Ready cDNA (Clontech) and two primers (first primer: 5 -GCCAGGG- 
CTGTGCTTCGCAAC-3 ' and nested primer: 5'-CAGGGCTGCAC- 
CGTGCTCGT-3') under the PCR conditions of 30 cycles of 94°C for 1 
min and 68°C for 2 min. 

To clone four splicing variants of ABCC12 cDNA in a similar 
manner, six sets of PCR primers were designed: the 5'-parts (C12-1 
forward primer: 5 ' - ATC AGGATGGTGGGTG AAGG-3 ' , C12-1 back- 
ward primer: 5 '-CTGGCTTCATGCTCCCATGTC-3 ' and C12-2 for- 
ward primer: 5 ' -GGTGGGTGAAGGACCCTA-3 ' , C12-2 backward 
primer 5'-CAGAACCGATTTGAG GCTGTCACT-3'), the middle 
parts (C12-3 forward primer: 5'-TGAAGCCAGC-AGGAAAGTACC- 
3', C12-3 backward primer: 5'-CTGCAGAAA GTTCTCTGCGT-3 ' 
and C12-4 forward primer 5'-CTCCTCTCTGCATGACACGG-3\ 
C12-4 backward primer 5 ' -C AC AC A-AAGC AGCTGACGTTC-3 ' ) , the 
3'-parts (C12-5 forward primer: 5 ' -GTAAGGTAC AACTT-GGATCCCT- 
3', C12-5 backward primer: 5 ' -TGCTGCTAGTAAC ATCGCAA-3 ' and 
C12-6 forward primer: 5'-CACCGCCTCTATG GACTCC AAGACTG-3 ' , 
C12-6 backward primer: 5 ' -CGCTACAAATCTGTGTCATTACCAC-3 
PCR was performed using the human adult liver, pancreas and testis 
cDNA (Clontech). The PCR consisted of 35 cycles of 94°C for 30 s, 60°C 
for 30 s, and 72°C for 2 min. 

The sequences of PCR products were analyzed by automated DNA 
sequencing (TOYOBO Gene Analysis, Japan). The whole cDNA se- 
quences of ABCCll and ABCC12 as well as their splicing variants 
were determined by assembling the partial sequences thus obtained. 
The cDNA sequences of ABCCll and its splicing variant A as well as 
ABCC12 splicing variants A, B, C, and D have been deposited to 
GenBank under Accession Nos. AF367202, AF411579, AF395908, 
AF395909, AF411577, and AF411578, respectively. 

Detection of ABCCll and ABCC 12 transcripts in human normal 
tissues and cancer cell lines. Transcripts of ABCCll and ABCC 12 
genes were detected by means of PCR, where human cDNA of normal 
tissues and cancer cell lines were purchased from Clontech. The PCR 
primers to detect ABCCll and ABCC12 were as follows: Cll forward 



primer: 5 '-TCTuOGA-CCTTCTTGTTTGG-3 ' . Cll backward primer: 
5 ' -TC AGTAC AGC ATTTGC AAC ACTT-3 ' and C12 forward primer: 
5 -CACCGCCTCTATGGACTCCAAGACTG-3', C12 backward 
primer: 5 ' -TC AATCTC AGGC ACTGGGGTGGT-3 ' . The PCR con- 
sisted of 38 cycles of 95°C for 30 s, 58°C for 30 s, and 72°C for 30 s and 
was followed by reaction at 72°C for 2 min. 

RESULTS AND DISCUSSION 

ABCCll and ABCC 12 Genes Located on Human 
Chromosome 16ql2 

Two new ABCC transporters, named ABCCll and 
ABCC12, were identified by database search on human 
chromosome 16 working draft (GenBank Accession No. 
AC007600) using the BLASTN program. In the present 
study, we have cloned cDNAs of these two new ABC 
transporters and their splicing variants to analyze the 
genetic polymorphism and expression profiles. 

ABCCll and ABCC 12 genes are tandemly located 
on human chromosome 16ql2.1 in a tail-to-head 
orientation with a separation distance of about 20 kb 
(Fig. 1). The ABCCll gene is encoded by a -68 kb 
gene consisting of 30 exons, whereas the ABCC12 
gene spans a —63 kb size and consists of 29 exons. 
The cDNAs of both ABCCll and ABCC12 had a 
Kozak consensus initiation sequence for translation 
(26) around the first ATG region, namely, 



5'-CTGAAA ATG A-3' for ABCCll and 5'-ATCAGG- 



ATG|G-3' for ABCC 12. The amino acid sequence de- 



duced from the cDNA sequence with the GENSCAN 
program revealed that ABCCll and ABCC 12 cDNAs 
contain single open reading frames encoding proteins 
consisting of 1383 and 1359 amino acid residues, re- 
spectively. ABCCll and ABCC 12 proteins have two 
sets of Walker A and Walker B motifs as well as two 
ABC signature sequences, so-called "C motifs," within 
the deduced protein. In terms of the amino acid se- 
quence, the identity of ABCCll with human ABCC1, 2, 

3, 4, 5, 6, 7, 8, 9, and 10 was 30.7, 30.8, 30.9, 32.9, 40.1, 
29.9, 26.0, 27.8, 27.9, and 29.3%, respectively. Like- 
wise, the identity of ABCC 12 with human ABCCl, 2, 3, 

4, 5, 6, 7, 8, 9, and 10 was 31.1, 30.4, 30.0, 32.8, 43.6, 
28.8, 26.9, 27.9, 27.8, and 29.0%, respectively. The 
identity between ABCCll and ABCC 12 was 47.4%. 
Based on the phylogenetic relationship, ABCCll and 
ABCC 12 are suggested to comprise a new subgroup 
with a close relation to ABCC5 that reportedly trans- 
ports several organic anions, including nucleotide an- 
alogues and cyclic nucleotides (23, 27, 28). 

Splicing Variants of Human ABCCll and ABCC 12 

Figure 2 shows splicing variants of ABCCll and 
ABCC12 cloned in this study. The cDNA of ABCCll 
variant A consists of 4476 nucleotides with 29 exons; 
however, the exon 28 is entirely deleted. This intron 
splicing follows the conventional GT-AG rule. The 
cDNA of this variant encodes a protein consisting of 
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FIG. 1. The genomic structures of ABCC11 and ABCC12 genes on the human chromosome 16. The cytogenetic location of the ABCC11 
and ABCC12 genes as well as the structures of exons and introns were analyzed by BLAST search on the Human Genome Project Working 
Draft (http://genome.cse.ucsc.edu/). 



1344 amino acid residues. Based on hydropathy anal- 
ysis, it is suggested that the variant A has 12 
membrane-spanning domains like ABCC11 (Fig. 3, 
left). However, due to the deletion of the exon 28, the 
variant A protein lacks 38 amino acid residues in the 
second ATP-binding cassette. 

In addition to ABCC12, there are four splicing vari- 
ants of ABCC12, namely variants A, B, C, and D con- 
sisting of 4034, 3886, 4127 and 4048 nucleotides, re- 
spectively. These splicing variants were identified in 
cDNA libraries from various tissues, such as adult 
liver, pancreas, testis, and fetal thymus. Figure 2A 
shows the exon alignments of the splicing variants. 
The 5'-half (exons 1 to 19) of cDNAs of these variants is 



identical to that of the ABCC12 cDNA. However, in the 
3'-half, variants A and D lack the exon 26, whereas the 
variant B lacks the exon 20. Furthermore, both vari- 
ants A and B lack 14 bp ( GTAGGTAC AGTAAG) in the 
exon 25, as indicated by 25A in Figs. 2A and 2B. The 
alternative splicing causing the 14-bp deletion may be 
related to the repeated GTAG sequence at the bound- 
ary between the exon 25 and the following intron (Fig. 
2B). Importantly, all of these variants have an extra 30 
bp sequence at the 5'-end of the exon 22 (Fig. 2A), 
where a putative stop codon for translation, i.e., TAG 
or TAA, was incorporated in their cDNAs (Fig. 2B). 

Figure 3 shows the putative protein structures of 
these splicing variants. Variants A, B and D cDNA 
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FIG. 2. (A) Schematic illustration of the cDNA structures of ABCC11, ABCC12 and their splicing variants. Based on our cDNA sequence 
data, exon structures were analyzed using the BLASTN program (http://www.ncbi.nlm.nih.gov/BLAST/) and human genome database. The 
number of 25A indicates the exon that is 14 bp shorter than the exon 25. (B) Comparison of exon 21 and 22 structures among cDNAs of 
ABCC12 and its splicing variants (upper column). Putative stop codons, i.e., TAG and TAA, in the cDNA of ABCC12 splicing variants are 
indicated by an underline. The sequence difference between the exons 25 and 25 A of ABCC12 cDNA (lower column). The sequences of exon 
and intron are written in capital and small letters, respectively. 

contain a single open-reading frame encoding 1009, these variants may have only eight to nine transmem- 
935, and 1009 amino acid residue- proteins, respec- brane domains and lack the C-terminal domain with 
tively. Because of the above-mentioned stop codon, the second ATP-binding cassette (Fig. 3, right). Inter- 
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FIG. 3. Schematic illustration of the putative protein topologies of ABCC11 and ABCC12 as well as their splicing variants. Transmem- 
brane domains were predicted using the SOSUI program (httpy/sosui.proteome.bio.tuat.ac.jp/sosuimenuO.html) and are numerically indi- 
cated in the illustration. 



estingly, the cDNA of the variant C is suggested to 
have two open reading frames encoding peptides con- 
sisting of 1009 and 331 amino acid residues, since the 
Kozak consensus sequence resides around the first 
ATG regions of these two peptide-coding sequences, 
i.e., 5 '-AT CAGG [A^G] G-3 ' for the first peptide; 5'- 
CTGAGA [ATGl G-3 ; for the second peptide. Hydropa- 
thy analysis suggests that, if translated, the first and 
the second peptides may have nine and three putative 
transmembrane domains, respectively (Fig. 3). In the 
case of ABCC8 (SURD, coexpression of two parts of the 
protein divided at Prol042 between transmembrane 
domains reportedly restored glibenclamide-binding ac- 
tivity (29). Furthermore, it was also reported that 
small carboxyl-terminal deletions of up to 23 amino 
acids left the functional activity of ABCB1 (MDR1/P- 
glycoprotein) (30). Therefore, it is of interest to know 
whether some of these splicing variants of ABCC12 
represent biological functions. Expression of those 
splicing variants and their function remain to be elu- 
cidated. 

Detection ofABCCll and ABCC12 Transcripts in 
Human Normal Tissues and Cancer Cell Lines 

The transcripts ofABCCll and ABCC12 genes were 
widely detected by PCR in various adult human tis- 
sues, including liver, lung, and kidney, as well as in 
several fetal tissues (Fig. 4). In addition, the tran- 
scripts ofABCCll and ABCC12 genes were observed 
in cell lines of carcinoma and adenocarcinoma origi- 



nated from breast, lung, colon and prostate. It should 
be noted, however, that the PCR products relatively 
reflected the amount of the transcripts of both full- 
length forms and splicing variants. 

To clarify this ambiguity, we therefore cloned 30 
ABCC11 cDNAs from human adult liver. The splicing 
variant A lacking the exon 28 (Fig. 2A) was observed at 
a frequency rate of about 25% in our cDNA clones of the 
ABCC11 gene (data not shown). Likewise, we have 
cloned a total of fifty cDNAs of the ABCC12 gene from 
adult human liver, testis, and pancreas, as well as from 
fetal liver and thymus. Interestingly, cDNAs of splicing 
variants A, B, C, and D (Fig. 2) were predominant, 
exceeding full-length one. Indeed, the total of those 
four splicing variants was more than 95% of the cloned 
cDNAjs (data not shown). 

CONCLUDING REMARKS 

The present study provides evidence that ABCC11 
and ABCC12 genes are transcribed in multiple splicing 
variant forms. Detailed profiling and functional anal- 
ysis of these splicing variants needs further studies. 
During this study, Tammur et al have most recently 
reported the cloning of ABCC11 and ABCC12 (31), 
however splicing variants of these genes were not ad- 
dressed in their report. On the other hand, recent 
studies have suggested a relationship between par- 
oxysmal kinesigenic choreoathetosis and a certain 
gene(s) located in the region of 16pll.2-ql2.1 (32, 33). 
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FIG. 4. Detection of the transcripts of ABCC11 and ABCC12 genes in human normal tissues and cancer cells by PCR. M, marker lane 
(100 bp DNA ladder). Normal tissues: Lanes— 1, heart; 2, brain; 3, placenta; 4, lung; 5, liver; 6, skeletal muscle; 7, kidney; 8, pancreas; 9, 
spleen; 10, thymus; 11, prostate; 12, testis; 13, ovary; 14, small intestine; 15, colon; 16, leukocyte; 17, fetal brain; 18, fetal lung; 19, fetal liver; 
20 fetal kidney; 21, fetal heart: 22, fetal spleen; 23, fetal thymus; 24, fetal skeletal muscle. Human cancer cell lines: Lanes— 25, breast 
carcinoma <GM01); 26, lung carcinoma (LX-1); 27, colon adenocarcinoma (CX-1); 28, lung carcinoma (GI-117); 29, prostatic adenocarcinoma 
(PC3); 30, colon adenocarcinoma <GI-112); 31, ovarian carcinoma (GI-102); 32, pancreas adenocarcinoma (GI-103). 



Since ABCC11 and ABCC12 genes are encoded at 
16ql2.1, it is tempting to study the biological function 
of ABCC11 and ABCC12 as well as to examine a po- 
tential link between the genetic polymorphism of these 
ABC transporters, including multiple splicing vari- 
ants, and the pathogenesis of paroxysmal kinesigenic 
choreoathetosis. 
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Impaired 2 ,3 -dideoxy-3 -thiacytidine accumulation in T-lymphoblastoid 
cells as a mechanism of acquired resistance independent of multidrug 
resistant protein 4 with a possible role for ATP-binding cassette C11 
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Cellular factors may contribute to the decreased efficacy of 
chemotherapy in HIV infection. Indeed, prolonged treatment 
with nucleoside analogues, such as a2idothymidine (AZT), 2',3'- 
deoxycytidine or 9-(2-phosphonylmethoxyethyl)adenine, induces 
cellular resistance. We have developed a human T lymphoblastoid 
cell line (CEM 3TC ) that is selectively resistant to the anti- 
proliferative effect of 2', 3'-dideoxy-3'- thiacytidine (3TC) because 
the CEM 3TC cells were equally sensitive to AZT, as well as the 
antimitotic agent, vinblastine. The anti-retroviral activity of 3TC 
against HIV-1 was also severely impaired in the CEM 3TC cells. 
Despite similar deoxycytidine kinase activity and unchanged 
uptake of nucleosides such as AZT and 2 '-deoxycytidine, CEM 3TC 
had profoundly impaired 3TC accumulation. Further studies 



indicated that CEM 3TC retained much less 3TC. However, despite 
a small overexpression of multidrug resistance protein (MRP) 4, 
additional studies with cells specifically engineered to overexpress 
MRP4 demonstrated there was no impact on either 3TC ac- 
cumulation or efflux. Finally, an increased expression of the 
MRP 5 homologue, ATP-binding cassette Cll (ABCC11) was 
observed in the CEM 3TC cells. We speculate that the decreased 
3TC accumulation in the CEM 3TC might be due to the upregu- 
lation of ABCC11. 



Key words : ABC transporter, HIV, retrovirus, nucleoside ana- 
logues. 



INTRODUCTION 

Long-term anti-retroviral therapy is the main strategy in clinical 
treatment of HIV- 1 infected patients. It is known that the best 
results in the efficacy of HIV therapy are obtained when various 
combinations of drugs are administered. Generally, combination 
anti-retroviral therapies consist of one or more nucleoside reverse 
transcriptase inhibitors (NRTIs), protease inhibitors, and or 
non-nucleoside reverse transcriptase inhibitors [1-3]. In fact, the 
combination of two nucleoside-based reverse transcriptase in- 
hibitors and a protease inhibitor, referred to as 'highly active 
anti-retroviral therapy' (HAART), dramatically suppresses 
plasma HIV-RNA levels to <50copies/ml [4^6]. Despite the 
efficacy of such therapeutic regimens, long-term treatment with 
HAART leads to the emergence of drug-resistant HIV strains, 
and genetic mutations in the reverse transcriptase gene have been 
isolated in many patients treated with HAART [7-9]. 

The presence of HIV mutations has been associated with 
virological failure; however, individuals also display signs of 
drug resistance in the absence of drug-resistant virus [10,1 1]. This 
observation is consistent with the concept that 'cellular' factors 
contribute to the failure of anti-retroviral therapy [12-15]. Indeed, 
most anti-HIV agents, specifically dideoxynucleosides, are 



phosphorylated by cellular kinases to compounds that inhibit 
HIV replication. Consequently, decreasing the cellular levels of 
these compounds could lead to an inability to suppress viral 
replication and contribute to the failure of anti-retroviral therapy. 
In this regard, it has been shown that long-term treatment of cell 
lines with NRTIs [such as 3 '-azido-3'-deoxy thymidine (AZT) 
and 2'-3'-dideoxycytidine (ddC)] results in diminished amounts 
of the phosphorylated forms of NRTIs. In these cases, decreased 
activity of the cellular kinases leads to antiviral resistance because 
of an impaired ability to accumulate phosphorylated metabolites 
[16-18]. 

Another cellular mechanism has previously been described to 
explain decreased drug accumulation and resistance to retroviral 
inhibitors: the increased efflux of phosphorylated drug [19]. 
Subsequently, we demonstrated that overexpression of a 
functionally uncharacterized ATP-binding cassette (ABC) drug- 
transporter [multidrug resistance protein (MRP) 4] was 
genetically linked to the decreased drug accumulation and 
resistance to some, but not all NRTIs [20] (for an overview of the 
ABC-family members and nomenclature see http://nutrigene.4t. 
com/humanabc.htm). The ABC transporters are mostly plasma 
membrane localized and show ATP-dependent transport of a 
broad range of compounds. Most MRP substrates are organic 



Abbreviations used: ABC, ATP-binding cassette; AZT, 3'-azido-3'-deoxythymidine; CNT, concentrative nucleoside carrier; dCK, deoxcytidine kinase; 
dCyd, deoxycytidine; ddC, 2'-3 / -dtdeoxycytidine; ENT, equilibrative nucleoside carrier; GFP, green fluorescence protein; HAART, highly active anti- 
retroviral therapy; [ 3 H]Cyd, [5- 3 H]cytidine; ID 50 , 50% inhibitory dose; MDR, multidrug resistance; [ 3 H]AZT, [Me- 3 H]AZT; MRP, multidrug resistant 
protein; MTT, 3-(4,5-dimethy!thia20l-2-yl)-2,5-diphenyl-2H-tetra20lium bromide; NRTI, nucleoside reverse transcriptase inhibitor; Pgp, P-glycoprotein; 
PMEA, 9-(2-phosphonylmethoxyethyl)adenine; RT, reverse transcriptiase; 3TC, 2'-3'-dideoxy-3'-thiacytidine (also called lamivudine); [ 3 H]3TC, 
[Me- 3 H]3TC. 

1 To whom correspondence should be addressed (e-mail John.schuetz@stjude.org). 



© 2002 Biochemical Society 



326 0. Turriziani and others 



anions and they are often conjugated to glutathione, glucuronide 
or sulphate. Notably, two members of the MRP family (MRP4 
and 5) efflux nucleotide analogues such as the nucleotide 
analogue, 9-(2-phosphonylmethoxyethyl)adenine (PMEA), azid- 
othymidine-monophosphate, and thioguanine-monophosphate 
[21,22]. Further, the cells that overexpressed MRP4 had de- 
creased antiviral efficacy for 2',3'-dideoxy-3'-thiacytidine (3TC), 
a finding strongly implicating MRP4 as a contributor to 3TC cell- 
ular resistance. In order to evaluate whether the prolonged 
treatment with 3TC was able to induce cellular resistance by 
this mechanism, we cultured a T-lymphoblastoid cell line in 
the presence of increasing concentrations of this nucleoside 
analogue. Our findings indicate that these cells acquire stable 
resistance to 3TC by a mechanism whereby 3TC accumulation is 
substantially decreased. Furthermore, the cells harbor no defect 
in the enzyme activating 3TC to a nucleotide, nor is there a 
general impairment in nucleoside uptake. However, despite a 
small overexpression of MRP4 in these cells, it is clear that 
another mechanism is responsible because MCF-7 cells 
engineered to overexpress MRP4 do not show impaired 3TC 
accumulation or increased 3TC efflux. 

MATERIALS AND METHODS 
Chemicals 

The 3TC, kindly provided by Glaxo Wellcome (Stevenage, Herts., 
U.K.), was dissolved in PBS and kept at -20 °C. AZT, [Me- 
3 H]AZT ([ 3 H]AZT, 3 Ci/mmol), ddC and [5- 3 H]cytidine 
([ 3 H]Cyd, 17.4 Ci/mmol) were purchased from Sigma Chemical 
Co (Milan, Italy). [Ate- 3 H]3TC ([ 3 H]3TC, 17.5 Ci/mmol) was 
purchased from Moravek Biochemicals (Brea, CA, U.S.A.). 
Commercial reagents and solvents were of analytical grade, 
unless otherwise stated. [ 3 H]2'-deoxycytidine ([ 3 H]dCyd, 18- 
30 Ci/mmol) was from Amersham (Milan, Italy). 

Selection of 3TC-resistant cell lines 

3TC-resistant cells were obtained by exposure of CEM cells, the 
parental cell line, to increasing concentrations of 3TC. CEM cells 
were initially propagated in the presence of 1 0 /iM 3TC. Doubling 
concentrations of 3TC were added to the culture medium and the 
cells were allowed to grow until they reached a cell density of 10 6 
cells/ml. After approx. 4 months, a stably resistant 3TC CEM 
line grew in the presence of 1 mM 3TC, with a doubling time 
similar to non-drug selected CEM cells. These cells were called 
CEM 3TC . 

Assay to determine the anti-growth activity of drugs in CEM and 
CEM 3TC 

The 3 -(4, 5-dimethylthiazol-2-yl)-2, 5 -diphenyl-2i/- tetrazolium 
bromide (MTT) assay was used to evaluate the anti-growth 
activity of drugs in CEM and CEM 3TC [23] Briefly, CEM and 
CEM 3TC were seeded in 96-weIl microtitre plates at a con- 
centration of 50000 cells/ well. Different concentrations of drugs 
were added to triplicate cultures. Four days later, 20 /il of MTT 
solution was added to each well and the cultures were incubated 
at 37 °C. The viability of cells was examined spectro- 
photometrically and the values were used to calculate the 50 % 
toxic concentration (TC 60 ) of the various test compounds. 

Assay of drug sensitivity 

CEM and CEM 3TC cells (3 x 10 5 ) were incubated with the HIV- 
PNL43 strain at a multiplicity of infection ('MOP) of 1 TCID 50 



(50 % tissue culture infectious dose)/cell. After 1 h at 0 °C, the 
cultures were washed three times with medium, resuspended in 
medium containing 3TC (or AZT or ddC) at the appropriate 
concentrations, and incubated at 37 °C. After 5 days the amount 
of viral antigens produced by infected cells was determined by 
ELISA (Abbott Laboratories, Abbott Park, IL, U.S.A.). The 
values for the 50 % inhibitory dose (ID 50 ) were calculated from 
plots of the percentage reduction of viral antigens. 

Determination of intracellular accumulation of 3TC or other 
nucleosides 

CEM and CEM 3TC cells were treated with [ 3 H]3TC (0.1 /iM; 
2 /iCi/ml) at different times (as indicated in the Figures) and then 
rapidly washed with ice-cold buffer and, after lysis, the amount 
of radioactivity was determined. The intracellular uptake of 3TC 
was also determined using a range of drug concentrations (from 
0.02 /iM to 0.2 /iM). To study the intracellular accumulation of 
AZT and dCyd, CEM and CEM 3TC cells were exposed to 
[ 3 H]AZT (0.6 /iM; 2/iCi/ml) and [ 3 H]dCyd (0.1 /iM; 2 /iCi/ml). 
At the indicated time, cells were washed, lysed and the radio- 
activity was determined. 

3TC retention 

CEM and CEM 3TC were preincubated with [ 3 H]3TC (2 /iCi/ml) 
for 2 h, washed with ice-cold PBS by centrifugation, resuspended 
in drug-free medium, and maintained at 37 °C. After 1 h, the 
intracellular radioactivity and radiolabeled drug released into 
the medium was assessed by scintillation counting. 

Reverse 1 transcription (RT) and PGR 

RNA from 5 x 10 6 CEM- or CEM 3TC was isolated using Trizol 
reagents (Gibco BRL, NY, U.S.A.). The RT-PCR analysis of the 
RNA sample was performed as follows. RNA (10 /ig) was 
incubated with 2 /il of random primers (150 /ig/ml) at 72 °C for 
lOmin, then combined with a mixture containing 4/il of 5% 
reaction buffer [250 mM Tris (pH 8.3), 375 mM KC1, 15 mM 
MgCl 2 ; Roche Molecular Biochemicals, Milan, Italy], 25 units of 
human placental ribonuclease inhibitor, 1 /il of 10 mM dNTP 
(Roche Molecular Biochemicals), 8 units of Moloney murine 
leukaemia virus RT (Roche Molecular Biochemicals). After 
90 min at 42 °C, 5 /il of cDNA was subjected to the PCR- 
mediated amplification for P-glycoprotein (Pgp) according to 
conditions described previously [24], 

Cell extracts 

CEM and CEM 3TC cell pellets, prepared as described above, were 
resuspended in 5 vols of 20 mM Bis-Tris, pH 6.5, containing 
1 mM dithiothreitol (DTT) and 0.5 mM PMSF, and sonicated 
for 5 s on ice at 50 W. Sonication was repeated five times at 
intervals of 10 s. Cell extracts were centrifuged at 4 °C at 5000 g 
in a benchtop centrifuge for 15 min. Supernatants were collected 
and assayed for protein concentration using the spectro- 
photometric-based Bio-Rad Protein Assay. 

2-Deoxycytidine kinase assay 

Deoxcytidine kinase (dCK) activity present in cell extracts was 
assayed with a radiochemical method which measures the 
formation of [ 3 H]dCMP from [ 3 H]dCyd. The cell extracts were 
incubated at 37 °C in 25 /il of a mixture containing 30 mM 
Hepes-K + (pH 7.5), 5 mM MgCl 2 , 5 mM ATP, 0.5 mM DTT 
and 2.4 /iM [ 3 H]dCyd (2200 c.p.m./pmole) or [ 3 H]3TC 
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(1500 c.p.m./pmole). The reaction, after 20 min incubation, was 
terminated by spotting 20 /i\ of the incubation mixture onto a 
25 mm DEAE paper disk (DE-81 paper; Whatman Biosystems 
Ltd., Maidstone, Kent, U.K.). The disks were washed three times 
in an excess of 1 mM ammonium formate, pH 3.6, in order to 
remove unconverted nucleoside, followed by a final wash in 
ethanol. The filters were dried and radioactive dCMP was 
estimated by scintillation counting in 1 ml of Betamax scin- 
tillating fluid (ICN Pharmaceuticals, Milan, Italy). One unit is 
defined as the amount of enzyme catalysing the formation of 
1 nmol of dCMP in 1 h at 37 °C. 

Immunoblot analysis 

Crude membranes were prepared from cells as described pre- 
viously [20]. Proteins were estimated using the Bio-Rad Protein 
Assay (Bio-Rad, Milan, Italy) and BSA was used as the standard. 
The crude membrane proteins (200 fig) were resuspended in 
standard Laemmli sample preparation buffer and loaded onto a 
7.5 % denaturing polyacrylamide gel and transferred to nitro- 
cellulose filters. The filters were blocked in 1 x PBS containing 
0.1 % Tween 20 and 10 % non-fat dry milk, immunoreacted with 
polyclonal rabbit anti-(MRP4) IgG followed by peroxidase 
conjugated anti-(rabbit IgG), and then developed with the 
Amersham ECL® detection system (Amersham, Airlington 
Heights, IL, U.S.A.). The immunoblots were stripped with gly- 
cine and reprobed with a monoclonal antibody to MRP1 (mPrl; 
Signet Laboratories, MA, U.S.A.). 

Pgp detection 

Pgp was also detected by FACS analysis. CEM, CEM 3TC and 
CEMVBL100 (a T-cell line expressing a high level of Pgp) cells 
were incubated with a Pgp-specific monoclonal antibody that 
recognizes an external epitope of Pgp (mMRK16, Alexis Italia, 
Florence, Italy). After incubation (30 min at 18-25 °C) the cells 
were washed with PBS and incubated with FITC-labelled goat 
anti-mouse immunoglobulin (Bioline Diagnostics, Turin, Italy) 
for an additional 30 min. After washing with PBS, the cells were 
resuspended in PBS and analysed by flow cytometry. This was 
performed using a FACScan (DAKO-Galaxy, Milan, Italy) flow 
cytometer. Forward and side light scatter were collected in linear 
mode and served to exclude unwanted events (i.e. debris, dead 
cells and aggregates). The fluorescence signal was collected in the 
log mode. 

Generation of MRP4 stable cell lines 

The human MRP4 cDNA was cloned into the MSCV-IRES- 
GFP [25] vector (kindly provided by Dr Robert Hawley, Holland 
Laboratory, American Red Cross, Rockville, MD, U.S.A.) using 
the EcoRl site. 293T cells were cotransfected with 10 /tg each of 
MSCV-MRP4-IRES-GFP, the helper plasmid pSRa-G, and 
pEQPAM3-e (kindly supplied by P. Kelly and E. F. Vanin, 
Department of Hematology /Oncology St Jude Children's Re- 
search Hospital, Memphis, TN, U.S.A.) by standard calcium 
phosphate precipitation [26]. The supernatant was collected 48 h 
after transfection, filtred, titred and frozen at — 80 °C. To confirm 
transfection, the 293T cells were analysed for green fluorescence 
protein (GFP) expression. Subsequently, the cells were trans- 
duced with MRP4. Briefly, the cells were plated at 5 x 10 4 
cells/60 mm tissue culture dish and then the medium was replaced 
by the retroviral supernatant supplemented with 6 /ig/ml poly- 
brene and placed overnight in an incubator at 37 °C in a 5 % C0 2 
humidified atmosphere. The transduction was repeated again 
twice for a total of three times. The transduced cells were 



expanded and the GFP-positive cells were selected after FACS 
[26]. Subsequently, a total lysate was prepared and loaded on a 
denaturing polyacrylamide gel for MRP4 and MRP1 detection 
by immunoblot [20]. 

Semi-quantitative RT-PCR analysis of MRP5 and ABCC11 

RNA was isolated from CEM and CEM 3TC cells using Trizol. 
First-strand cDNA was made from 2.5 /ig of RNA using the 
cDNA synthesis kit for PCR (Bbehringer Mannheim, 
Indianapolis, IN, U.S.A.) in a final volume of 20 MRP5 and 
ABCC11 (also called MRP8) were amplified with 125 ng of 
cDNA in a final volume of 50 /a1 containing 200 /^M each of 
dATP, dCTP, dGTP, and dTTP, and 300 nM each of forward 
and reverse primers using the Expand High Fidelity PCR System 
(Boehringer Mannheim). Samples were denatured for 5 min at 
94 °C, followed by cycles of 94 °C for 30 s, 60 °C for 30 s, 68 °C 
for 2 min, and a final incubation at 68 °C for 2 min. The number 
of cycles for.MRPS was 26 and the number of cycles for ABCC1 1 
was 30. Primers were as follows: MRP5, forward primer, 5'- 
TCCTGCCTTCTGTCCTGGTGT-3' and reverse primer, 5'- 
CTGTGCGACGACTGCGGTGAG-3' ; ABCC1 1 forward 
primer, 5'-AGAATGGCTGTGAAGGCTCAGC- 3' and reverse 
primer, 5' GTTCCTCTCCAGCTCCAGTGC 3'. The predicted 
sizes for the MRP5 and ABCC11 products were 390 bp and 
550 bp respectively. Aliquots (10 /il) of the PCR reactions were 
loaded onto a 1 % agarose gel containing ethidium bromide. 

RESULTS 

Selection of 3TC-resistant CEM cells 

Cellular factors, such as altered drug activation and/or decreased 
accumulation may cause failure of anti-retroviral drugs [13-15]. 
To determine if these cellular factors account for the variable 
response to the anti-retroviral drug 3TC, we cultured the CEM 
T-cell line in increasing concentrations of 3TC. After approx. 4 
months of culture in the presence of 1 mM 3TC, CEM cells were 
obtained that were refractory to the growth inhibitory properties 
of 3TC. These cells are referred to as CEM 3TC and their resistance 
to 3TC was stable for 4 months in the absence of 3TC. 
Furthermore, the resistance to the cytotoxic effects of 3TC was 
selective because CEM 3TC were equally sensitive to AZT and 
vinblastine (Table 1). 

CEM 3TC are impaired for antiviral efficacy 

To evaluate whether the CEM 3TC cells had an impaired ability to 
inhibit HIV replication, CEM and CEM 3TC cells were infected 
with HIV (see Materials and methods section) and then treated 



Table 1 Sensitivity of CEM and CErVl 3TC to antiviral and antigrowth 
activities of 3TC, AZT, ddC and vinblastine 

TC^ is the concentration producing 50% cytotoxicity and ID 5D is the dose producing 50% 
inhibition of HIV replication. 



TC50 IDso 

Cells 3TC (mM) AZT (mM) VBL (ng/ml) 3TC (nM) AZT (nM) ddC (nM) 

CEM 4.00 + 2 0.30 + 0.10 0.08 + 0.06 6±3 7 + 4 8 + 3 

CEM 3TC > 10f 0.40 ± 0.1 5 0.10 + 0.09 60±10* 9 + 5 10 + 4 

t Owing to 3TC insolubility, concentrations greater than 10 mM were not evaluated. 
* CEM versus CEM 3TC , /><0.05. 
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Figure 1 Antiviral activity is reduced in 3TC resistant cells 

Antiviral activity of 3TC in CEM (diamonds) and CEM 3TC (squares). Cells were infected with HIV- 
PNL43 and cultured in the presence of different concentrations of 3TC. After 5 days the amount 
of. viral antigen produced by infected cells was measured as described in the Material and 
methods section. Each point represents the mean and the bars + one standard deviation from 
the mean. 



with various concentrations of 3TC (Figure 1). The viral yield 
was determined 5 days later by measuring the viral antigen 
released into the culture supernatant. We found that CEM 3TC 
were markedly resistant to the antiviral activity of 3TC, with the 
3TC ID 50 value for HIY-1 being approx. 1 0-fold higher in 
CEM 3TC (ID 60 = 60 nM) compared with the CEM cell line (ID 50 
= 6.0nM) (Figure 1). Again, the resistance to the antiviral 
activity of 3TC was selective because CEM 3TC were equally 
sensitive to the antiviral activity of ddC and AZT (Table 1). 

Deoxycytidine kinase activity is not decreased in CEM 3TC 

After cellular uptake by nucleoside uptake carriers,. 3TC is 
phosphorylated by deoxycytidine kinase [27]. Deoxycytidine 
kinase effectively phosphorylates both enantiomers of dCyd [28] 
and dCyd analogues, such as 3TC [29]. Therefore, we evaluated 
the enzymic activity, of dCK in CEM and CEM 3TC cells using 
both 3TC and dCyd as the substrate. The results indicated that 
the enzymic activity of dCK from CEM 3TC was not decreased 
compared with the parental CEM cells. In fact, using 3TC as 
substrate, the dCK activity was 0.12 ±0.01 units/mg of protein 
in CEM and 0.16 + 0.02 units/mg in CEM 3TC . Similarly, when 
dCyd was used as substrate the dCK activity was 0.8 + 0.08 units/ 
mg in CEM and 1.24+0.12 units/mg in CEM 3TC . These results 
indicate, using either dCyd or 3TC, that the 3TC-resistance of 
CEM 3TC cannot simply be ascribed to a reduction in dCK 
activity. 

Decreased 3TC accumulation in CEM 3TC without a general 
decrease in nucleoside uptake 

Resistance to 3TC could be due to the reduced intracellular 
accumulation of drug, secondary to either transport changes or 
alterations in enzymic activation. Uptake of radiolabelled 3TC 
was used to assess variations in 3TC transport. Figure 2(A) 
shows there was no significant difference between the two cell 
lines in their initial uptake of 3TC (< 50 min). However, when 
the cells were incubated for longer intervals (> 1 h) dramatic 
differences in accumulation emerged. The CEM cells continued 
to accumulate radiolabelled 3TC, whereas at 8 h, the CEM 3TC 
achieved a steady-state level of drug that was as much as 3-fold 
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Figure 2 3TC Accumulation is impaired in 3TC resistant cells 

(A) Intracellular uptake of [ 3 H]3TC in CEM (open bars) and CEM 3TC (closed bars). Cell cultures 
were incubated with 0.1 /iWi [ 3 H]3TC, and at the indicated times cells were extensively washed 
with ice-cold PBS, lysed and the radioactivity determined by scintillation counting. The results 
are the mean (±SD) from three independent experiments {*P< 0.05). (B) Long-term 3TC 
accumulation. Either CEM (open bars) or CEM 3TC (closed bars) were incubated with 0.1 (M 
[ 3 H]3TC (or the indicated intervals (*P < 0.05). (C) Uptake of different concentrations 
of [ 3 H]3TC, CEM (diamonds) and CEM 3TC (squares) were incubated with different concentra- 
tions of [ 3 H]3TC. After 8 h, intracellular radioactivity was evaluated by scintillation counting. 
The points represent the mean of two independent experiments performed in triplicate, with the 
error bars indicating ± one standard deviation [*P < 0.05). 



lower than the maximum attained in the CEM cells. It is 
interesting to note that despite the continued presence of extra- 
cellular drug, the 3TC accumulation decreased in the CEM cells 
after 72 h of 3TC incubation. This suggests that the transporter 
effluxing 3TC is induced, a phenomenon previously reported for 
AZT [30] (Figure 2B). Finally, the CEM 3TC cells accumulated 
much less drug than the CEM cells at multiple concentrations of 
3TC (Figure 2C). It is notable that the 5-fold lower 3TC 
accumulation roughly corresponds with the greater 3TC con- 
centration required to inhibit HIV replication (Figure 1 and 
Table 1) and supports the idea that impaired 3TC accumulation 
is responsible for the enhanced survival of these cells in 3TC, as 
well as the requirement for more 3TC to inhibit HIV replication. 
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Figure 3 3TC resistant cells have a selective defect in 3TC accumulation 

Intracellular accumulation of 3TC t AZT and dCyd in CEM (open bars) and CEM 3TC (solid bars). 
Cells were incubated with (A) 0.1 pM [ 3 H]3TC, (B) 0.6 fM [ 3 H]AZT or (C) 0.1 /iU [ 3 H]dCyd. 
At the indicated times, cells were washed with ice-cold PBS, lysed and the radioactivity 
determined by scintillation counting. The results are the average of two independent 
experiments done in triplicate. 



Table 2 3TC drug accumulation and retention in 3TC resistant cells 

The values are the means of two independent experiments, each done in duplicate. For both 
intracellular radioactivity and radioactivity released into the supernatant, values were significantly 
different, P < 0.05 in each case. 



Percentage of total [ 3 H]3TC 



Cells 



Intracellular 



Supernatant 



CEM 
CEM,, 



82.4 + 1.8 
657 + 0.8 



17.6 + 1.8 
34.3 + 0.8 



To determine if the impaired accumulation of 3TC was specific 
for 3TC, and to rule out a general defect in nucleoside uptake 
carriers, accumulation of [ 3 H]3TC, [ 3 H]AZT and [ 3 H]dCyd was 
determined (Figure 3). It is known that AZT is a substrate for 
both the concentrative nucleoside carrier (CNT) and equilibrative 
nucleoside carrier (ENT2) [31], while deoxycytidine is a known 
substrate for CNT [32]. The intracellular radioactivity was 
then measured as described in the Material and methods section. 
The studies reveal that CEM 3TC cells are not impaired for the 
accumulation [ 3 H]AZT and [ 3 H]dCyd, which indicates that 
the nucleoside-uptake carriers transporting these nucleosides 
are not impaired in the CEM 3TC cells. 



cell line to retain 3TC, we evaluated 3TC retention. The cells 
were pre-loaded with [ 3 H]3TC followed by resuspension in drug- 
free media. Subsequently, the amount of radioactivity in the cells 
and media was determined. The results, shown in Table 2, 
indicate that the CEM 3TC cells retained much less intracellular 
radioactivity than the CEM cells. Furthermore, a corre- 
spondingly higher percentage of radioactivity was released into 
the medium from CEM 3TC compared with CEM cells. This 
indicates that CEM 3TC have a decreased ability to retain 3TC, 
and this correlates with the selective impaired accumulation of 
3TC (Figure 2). 

Next, we evaluated whether the decreased 3TC accumulation 
could be due to an increased expression of Pgp using FACS 
analysis with an antibody that detects a surface Pgp epitope (see 
Materials and methods section). We found that both CEM and 
CEM 3TC have undetectable Pgp, unlike the positive control, 
CEMVBL 100 , that expresses high amounts of Pgp (results not 
shown). Furthermore, we demonstrated that neither the CEM 
nor CEM 3TC cells had detectable levels of MDR1 transcript when 
used amplified to the same extent as the MDR1 -positive cell, 
CEMVBL 100 . 



Expression of MRP4 in CEM 3TC and transport of 3TC in cells 
ectopically expressing MRP4 

Recent studies indicated that the ABC transporter, MRP4, plays 
a role in the cellular resistance to anti-retroviral nucleoside 
drugs, including 3TC [20]. To evaluate MRP4 expression, we 
performed immunoblot analysis on crude membranes from the 
CEM 3TC cells (Figure 4). We found that the level of immuno- 
reactive MRP4 increased approximately 2-fold in the CEM 3TC 
cells. In contrast, MRP1 was not different in the two cell lines. It 
is interesting to note that MRP4, which is only a 1325-amino- 
acid-residue protein, runs at an estimated size of 220 kDa, 
whereas MRP1, a 1531-amino-acid-residue protein, runs at an 
estimated size of 190 kDa. This is probably due to the fact that 
MRP4 is extensively glycosylated with at least seven predicted 
N-linked asparagine glycosylation sites [25]. 

To determine whether MRP4 played a role in transport of 
3TC, we developed cell lines that ectopically overexpressed 
MRP4 (Figure 4B). We confirmed the phenotype of these cells by 
evaluating the uptake of PMEA, a known MRP4 substrate [20] 
(Figure 4C). These cells were then assessed for the uptake of 
3TC (Figure 4D). We evaluated 3TC uptake after a 24 h incu- 
bation in concentrations of 3TC from 0.5 to 10 /iM. The total 
accumulation of 3TC radioactivity was the same in the MRP4 
cells as in the vector-only transfected cells (Figure 4D). Since 
longer incubations (48 h) produced slightly lower 3TC accumu- 
lation, we assessed whether efflux was faster in the MRP4 cells. 
The cells were loaded with 3TC, resuspened in drug-free media, 
and then assessed for both 3TC intracellular-associated radio- 
activity and the radioactivity released into the media (Figure 4E). 
For both cells lines, the time to decrease the intracellular 
radioactivity to one-half the initial level was approx. 20 min 
and, notably, a corresponding efflux of radioactivity into the 
media occurred. These studies directly demonstrate in MCF-7 
cells overexpressing MRP4 that 3TC efflux is not enhanced by 
MRP4 overexpression. 



CEM 3TC have decreased retention of 3TC, with no change in Pgp 
and a small increase in MRP4 expression 

To explore whether the defect in cellular 3TC accumulation 
could be associated with a decreased capability of the resistant 



Expression of MRP4, MRP5 and ABCC11 in CEM 3TC 

The efflux of nucleotide analogues in mammalian cells has been 
confirmed for MRP4 and MRP5 [21], Although we have demon- 
strated that cells specifically overexpressing MRP4 do not have 
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Figure 4 Analysis of MRP1 and MRP4 in CEM^ and the impact of MRP4 upon 3TC transport 

(A) Lysates of CEM and CEM 3TC cells were analysed on an immunoblot with antiserum against MRP4, and then with antiserum against MRP1. (B) Immunoblot analysis of MRP4 expression in 
cells engineered to overexpress MRP4. (C) Functional analysis of MRP4 using PMEA accumulation, as described in the Materials and methods section. (D) MCF-7 cells ectopically expressing 
MRP4'(#) or the control vector (O) were incubated with [ 3 H]3TC (0.5-1 0>M). After 24 h, the accumulation of 3TC radioactivity was determined. (E) 3TC efflux in cells ectopically expressing 
MRP4. Control vector (O) and MRP4 expressing cells (•) were loaded with [ 3 Hj3TC followed by resuspension in drug-free media. Both intracellular-associated 3TC radioactivity (Cell Associated) 
and 3TC radioactivity released into the media (Supernatant) were assessed. 




mrp5 392bp 
mrp8 550bp 



Figure 5 Analysis of MRP5 and ABCC11 (MRP8) expression in CEM^ 
cells 

Total RNA was isolated from both CEM and CEM 3TC cells, followed by RT-PCR. The lower band 
in MRP8 (ABCC11) was sequenced and found to be a non-specific band. The primers and 
conditions are described in the Materials and methods section. 



decreased accumulation or increased efflux of 3TC, it remains 
possible that another ABC transporter effluxes 3TC metabolites. 
Our recent investigations and others studies [33,34] indicate that 
MRP5 has two closely related homologues on chromosome 16. 
We evaluated the expression of ABCC1 1 mRNA levels in CEM 
and CEM 3TC cells by RT-PCR (ABCC12 was not detected). In 
addition, we assessed the level of MRP5 mRNA (Figure 5). We 
found that the level of MRP5 was unchanged in the CEM 3TC 
cells. In contrast, semi-quantitative RT-PCR revealed that 
ABCC11 was increased 6-fold. The magnitude of this increase 
in ABCC1 1 mRNA is comparable with the impairment in 3TC 
antiviral efficacy in these cells. The lack of a direct correspondence 
may be due to the possibility that the protein is expressed at a 
much higher level than the mRNA; however, at this time, it is 
impossible to determine whether ABCC11 protein levels are 
increased due to the unavailability of a specific antibody. 

DISCUSSION 

Recent findings have recognized that anti-retro viral drug treat- 
ment causes a phenotype described as cellular resistance [12-15]. 
This phenomenon is consistent with the knowledge that different 
cell lines require a broad range in the concentration of anti- 
retroviral drug to inhibit HIV replication [35]. Two main 
mechanisms contribute to cellular resistance: altered metabolism 
of nucleoside analogues due to impaired nucleoside phosphoryl- 
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ation and increased efflux of the compounds by membrane 
transport mechanisms [20,21]. 

Our results demonstrate that prolonged treatment with 3TC 
selects for cells with an acquired, stable resistance to 3TC 
Compared with the CEM cells, CEM 3TC required about 10-fold 
more 3TC to inhibit HIV. Moreover, these cells showed increased 
resistance to the cytotoxic effects of 3TC. However, the CEM 3TC 
were as sensitive to AZT, ddC and vinblastine as the CEM cells, 
demonstrating that this resistance is specific for 3TC. Notably, 
3TC resistance was not due to decreased dCK activity, the 
principal enzyme required for activation of 3TC [27]. Fur- 
thermore, uptake of the natural nucleosides dCyd and azidothy- 
midine was unaltered in CEM 3TC . Thus, these findings rule 
out the possibility of a general defect in nucleoside uptake 
because such alterations would have undoubtedly have impacted 
upon AZT and dCyd accumulation, considering that AZT is 
transported by both ENT2 and CNT, and that dCyd is trans- 
ported by CNT [32,36]. In contrast, 3TC accumulation was 
substantially reduced in CEM 3TC cells and was associated with 
decreased intracellular retention. Consequently, we postulated 
that an efflux transporter was responsible for preventing 3TC 
accumulation in the resistant cells. In fact, drug efflux pumps are 
an important part of the cellular defence against cytotoxic 
compounds. Specifically, cells overexpressing drug- transporting 
proteins become resistant to a wide range of drugs with different 
structures and/or cellular targets. This phenomenon is known as 
multidrug resistance (MDR). The most well characterized of 
these drug transporters is Pgp [37]. The overexpression of Pgp 
has been described for many cancer cells with acquired resistance 
to chemotherapuetics [38]. Several studies have reported that 
Pgp-expressing cells are also resistant to the anti-growth and 
antiviral activity of some NRTIs [39-41]. On the basis of these 
findings, we evaluated Pgp expression in CEM 3TC . However, as 
anticipated, based upon 3TC structure, the CEM 3TC cells had no 
detectable Pgp overexpression. 

Recently, it has been reported that one member of the MRP 
family, MRP4, is overexpressed in cells that acquire resistance to 
the cytotoxic effects of the modified nucleotide analogue, PMEA 
[20]. Notably, overexpression of MRP4 impairs the antiviral 
efficacy of PMEA and other nucleoside analogues, such as 3TC 
and AZT, In our 3TC resistant cells, we found a small increase 
in MRP4 (< 2-fold), suggesting that 3TC metabolites could be 
MRP4 substrates. However, an analysis of MCF-7 cells 
ectopically expressing MRP4 showed that MRP4 does not affect 
either the accumulation or the efflux of 3TC. This result contrasts 
with the previously reported findings; however, it should be 
noted that only PMEA and AZT-monophosphate were effluxed 
to a greater extent in the MRP4 overexpressing cells, and it was 
not directly demonstrated that 3TC metabolites were more 
readily effluxed in those cells [20]. Thus, based on the current 
studies, it seems unlikely that MRP4-mediated efflux is directly 
involved in cellular 3TC resistance and that the impaired 3TC 
accumulation and decreased retention is due to an additional 
3TC transporter in the CEM 3TC cells. Since MRP5 has been 
demonstrated to transport similar substrates as MRP4 [21], we 
evaluated its mRNA expression, but found no difference in 
MRP5 expression in the CEM 3TC cells. However, recent studies 
[33] have determined that MRP4 and MRP5 homologues are 
found on chromosome 16ql2. These homologous genes also lack 
an N-terminal domain that is found in the prototypical ABCC1 
(i.e. MRP1). In the CEM 3TC cells, we found increases in ABCC1 1 
mRNA expression (6-fold). However, in the absence of an 
antibody we are unable, at this time, to confirm if ABC1 1 protein 
is overexpressed. Nevertheless, it is possible that this transporter 
contributes to the efflux-mediated resistance to 3TC. 



In conclusion, our reuslts are most consistent with the concept 
that 3TC resistance is mediated by an inability to adequately 
accumulate 3TC. This is not due to impaired 3TC 
phosphorylation or initial uptake. It is possible that increased 
ABCC1 1 expression decreases 3TC accumulation and increases 
cellular 3TC resistance. However, at the present time, we cannot 
directly confirm this possibility. The current studies support the 
idea that 3TC resistance may be due to ABC1 1 overexpression. 
However, we can not exclude the likelihood that a combination 
of increased MRP4 and ABCC1 1 underlie the 3TC resistance and 
impaired accumulation in these cells. This might be analogous 
to the overexpression and potential role of MRP1, MRP2 and 
ABCG2 in cells in resistance to the camptothecin class of 
cancer chemotherapeutic drugs [42]. Future studies will ad- 
dress the possibility of such interactions among ABCC11 and 
MRP4. 
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The Sequence of the Human Genome 
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THE HUMAN GENOME 

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of 
the human genome was generated by the whole-genorrle shotgun sequencing 
method. The 14.8-biUton bp DNA sequence was generated over 9 months from 
27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five individuals. Two 
assembly strategies— a whole-genome assembly and a regional chromosome 
assembly— were used, each combining sequence data from Celera and the 
publicly funded genome effort. The public data were shredded into 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced, without including biases inherent in the cloning and assembly 
procedure used by the publicly funded group. This brought the effective cov- 
erage in the assemblies to eightfold, reducing the number and size of gaps in 
the final assembly over what would be obtained with 5.11-fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
independent mapping data. The assemblies effectively cover the euchromatic 
regions of the human chromosomes. More than 90% of the genome is in 
scaffold assemblies of 100,000 bp or more, and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26 588 protein-encoding transcripts for which there was strong corroborating 
evidence and an additional -12,000 computationally derived genes with mouse 
matches or other weak supporting evidence. Although gene-dense clusters are 
obvious, almost half the genes are dispersed in low G+C sequence separated 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome 
is spanned by exons, whereas 24% is in introns, with 75% of the genome being 
intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- 
mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- 
velopmental regulation, and with the hemostasis and immune systems. DNA 
sequence comparisons between the consensus sequence and publicly funded 
genome data provided locations of 2.1 million single-nuclebtide polymorphisms 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity in the level of poly- 
morphism across the genome. Less than 1% of all SNPs resulted in variation in 
proteins, but the task of determining which SNPs have functional consequences 
remains an open challenge. 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causation 
of disease, and the interplay between the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was first for- 
mally proposed in 1985 (1). In subsequent 
years, the idea met with mixed reactions in 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for determining the order of nucleotides of 



DNA using criam-terrninating nucleotide ana- 
logs (3). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained 
with this new technology (6). From early se- 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of the ex- 
pressed sequence tag (EST) method of gene 
identification (5), which is a random selection, 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST method 
led to the rapid discovery and mapping of hu- 
man genes (P). The increasing numbers of hu- 
man EST sequences necessitated the develop- 
ment of hew computer algorithms^) analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). 

The complete 49-kbp bacteriophage lamb- 
da genome sequence was detennined by a 
shotgun restriction digest method in 1982 
(11). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (12)> 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (14, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion (76) of an approach to simulta- 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAC end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome (19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (21). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in human genome sequencing worldwide 
was very slow (22\ and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (23): Many of the principles of operation 
of a genome-sequencing "facility were estab- 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TTGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1 -year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -~5-fold 
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coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the absence of interim assem- 
blies to report 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight- 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the —3 
billion bp thatmake up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential - bias to the final sequence from chi- 
. meric clones, foreign DNA contamination, or 
Disassembled contigs. Insofar as a correctly 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1 304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 
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Various policies of the United States and the 
World Medical Association, specifically the 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) (31) that- helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Hcallh 
and Human Services. This Certificate autho- 
. rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males, 
~130 ml of whole, heparinized blood was 
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collected, as well as five specimens of semen, 
collected over a 6-week period Permanent 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males — one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DC1). The decision of whose. DNA to 
. sequence was based on a complex mix of fac- 
tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quality plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 1 0 
kbp, and 50 kbp (Table 1) (33). 

In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored ef- 
fectively (Fig. 2) (34). 

Current sequencing protocols are based on 
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the dideoxy sequencing method (55), which 
typically yields, only 50Q to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing facility is 
supported by a high-performance computation- 
al facility (36). 

The process for DNA sequencing was mod- : 
ular by design and automated. Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefiilly 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drosophila project in May 
1999. The ABI 3700 is a fully automated 
capillary array sequencer and as such can 
be operated with a minimal amount of 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces -with samples through the elimi- 
nation of manual sample loading and lane- 
tracking errors associated with slab gels. 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



through the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation, before 
implementation, and production-scale testing 
of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector trinuning, the 
average trimmed sequence length was 543 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for an^ sequence 
with a significant match to a contaminant was 
discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-generated data input into assembly. 





Individual 


2 kbp 


Number of reads for different insert libraries 
10 kbp 50 kbp 


Total 


Total number of 
base pairs 


No. of sequencing reads 


A 


0 


0 


2,767,357 


2,767357 


1,502,674,851 




B 


11,736.757 


7.467,755 


66,930 


19,271,442 


10,464,393,006 




C 


853,819 


881,290 


0 


1.735,109 


942,164,187 




D 


952,523 


1,046,815 


0 


1.999,338 


1,085,640,534 




F 


0 


1,498,607 


0 


1,498,607 


813,743,601 




Total 


13,543,099 


10,894,467 


2,834,287 


27,271,853 


14,808.616,179 


Fold sequence coverage 


A 


0 


.0 


0.52 


0.52 




(2.9-Gb genome) 


B 


2.20 


1.40 


0.01 


3.61 






C 


0.16 


1.17 


0 


0.32 






D 


0.18 


0.20 


0 


0.37 






F 


0 


0.28 


0 


0.28 






Total 


2.54 


2.04 


0.53 


5.11 




Fold clone coverage 


A 


0 


0 


18.39 


18.39 






B 


2.96 


11.26 


0.44 


14.67 






C 


0.22 


1.33 


0 


1.54 






D 


0.24 


1.58 


0 


1.82 






F 


0 


2.26 


0 


2.26 






Total 


3.42 


16.43 


18.84 


38.68 




Insert size* (mean) 


Average 


1,951 bp 


10,800 bp 


50,715 bp 






Insert size* (SD) 


Average 


6.10% 


8.10% 


14.90% 






% Matesf 


Average 


74.50 


80.80 


75.60 







•insert size and SD are calculated from assembly of mates on contigs. f% Mates is based on laboratory tracking of sequencing runs. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly, in 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (26); By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping informatioiL The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 
tion. The second method provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 
phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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Fig. 2 Flow diagram for sequencing pipeline. Samples are received 
selected, and processed in compliance with standard operating proce- 
dures, with a focus on quality within and across departments Each 
process has defined inputs and outputs with the capability to exchange 
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described further in the text 



and provide a comparison to the public genome 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example ,of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the ~25-foId larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we 
were able to characterize the range of insert 
: sizes, in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that , has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7 X clone coverage. 

The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (J0). The 
BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
The data for each BAC is deposited at one of 
four levels of completion. Phase 0 data are a set , 
of generally unassembled sequencing reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 
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sequences. In the past 2 years the PFP has 
focused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (38) t filtered for a 25-bp 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; *(ii) the nonhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with "a total of 
4363.7 Mbp . of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes (18). 

2.2 Assembly strategies . 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering of the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96 X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. CenBank data input into assembly. 



at least 22% of the BACs contained sequence 
data that were not part of the given BAC (41), 
possibly as a result of sample-tracking errors 



Center 



Completion phase sequence 



Whitehead Institute/ 
MIT Center for 
Genome Research, 
USA 



Washington University, 
USA 



Baylor College of 
Medicine, USA 



Production Sequencing 
Facility, DOE Joint 
Genome Institute, 
USA 



The Institute of Physical 
and Chemical 
Research (RIKEN), 
Japan 



Sanger Centre, UK 



Others* 



All centers combinedf 



J La Lid LIU 


0 


1 and 2 


3 


Number of accession records 


2,825 


6,533 


363 


Number of contigs 


243,786 


138,023 


363 


Total base pairs 


194,490,158 


1,083,848,245 


48,829,358 


Total vector masked (bp) 


1,553,597 


875,618 


2,202 


Total contaminant masked 


13,654,482 


4,41 A055 


98.028 


(bp) 








Average contig length (bp) 


798 


7,853 


134,516 


Number of accession records 


19 


3,232 


1,300 


Number of contigs 


2,127 


61,812 


1,300 


Total base pairs 


1,195,732 


561,171,788 


164,214,395 


Total vector masked (bp) 


21,604 


270,942 


8,287 


Total contaminant masked 


22,469 


1,476^141 


469,487 


(bp) 








Average contig length (bp) 


562 


9,079 


126,319 


Number of accession records 


0 


1,626 


363 


Number of contigs 


0 


44,861 


363 


Total base pairs 


0 


265.547,066 


49,017,104 


Total vector masked (bp) 


0 


218,769 


4,960 


Total contaminant masked 


0 


1,784,700. 


485,137 


(bp) 








Average contig length (bp) 


0 


5,919 


135,033 


Number of accession records 


135 


2,043 


754 


Number of contigs 


7,052 


34,938 


754 


Total base pairs 


8,680,214 


294,249,631 


60,975,328 


Total vector masked (bp) 


22,644 


162,651 


7,274 


Total contaminant masked 


665,818 


4,642,372 


118,387. 


(bp) 








Average contig length (bp) 


1,231 


8,422 


80,867 


Number of accession records 


0 


1,149 


300 


Number of contigs 


o 


25,772 


300 


Total base pairs 


0 


182,812,275 


20,093.926 


Total vector masked (bp) 


0 


203,792 


2,371 


Total contaminant masked (bp) 


0 


308.426 


27,781 


Average contig length (bp) 


0 


7,093 


66,978 


Number of accession records 


o 


4,538 


2,599 


Number of contigs 


o 


74,324 


2,599 


Total base pairs 


0 " 


689,059,692 246,118,000 


Total vector masked (bp) 


o 


427,326 


25,054 


Total contaminant masked (bp) 


0 


2,066,305 


374,561 


Average contig length (bp) 


0 


9,271 


94,697 


Number of accession records 


42 


1,894 


3,458 


Number of contigs 


5,978 


29,898 


3,458 


Total base pairs 


5,564,879 


283,358.877 246,474,157 


Total vector masked (bp) 


57,448 


279,477 


32,136 


Total contaminant masked 


575,366 


1,616,665 


1,791,849 


(bp) 








Average contig length (bp) 


931 


9,478 


71,277 


Number of accession records 


3,021 


21,015 


9,137 


Number of contigs . 


258,943 


409,628 


9,137 


Total base pairs 


209.930,983 3 


360,047,574 835,722,268 


Total vector masked (bp) 


1,655,293 


2,438,575 


82,284 


Total contaminant masked 


14,918,135 


16,311,664 


3,365,230 


(bp) 








Average contig length (bp) 


811 


8,203 


91,466 



*Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center 
Cenomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; 
Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence 
Uvermore National Laboratory Cold Spring Harbor Laboratory: Los Alamos National Laboratory; Max-Planck Institut fuer 
Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic 
Research; The institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas 
Southwestern Medical Center, University of Washington. fThe 4,405.700.825 bases contributed by all centers were 
shredded Into faux reads resulting in 2.96X coverage of the genome. 



(see below). In short, we performed a true ab 
initio whole-genome assembly in which \ x 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

In the compartmentalized shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segmcnti 
or "components" that could be determined with 
confidence, and then shotgun assembly was op. 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux reads 
to ensure an independent ab initio assembly of 
the component By subsetting the data in this 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated. This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
pared for consistency. TTie„quality of the parti- 
tioning into components was crucial so that 
different genome regions were not mixed to- 
gether. We constructed components from (i) the 
longest scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data unique 
to Celera's data set The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5 X Celera data mapped to those 
bactigs as input This effort was undertaken as 
an interim step solely . because the more accurate . 
and complete the scaffold for a given sequence 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on the 
basis of sequence overlap and mate-pair infor- 
mation. We further visually inspected and cu- 
rated the scaffold tiling of the components to 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning was ignored, 
and an independent, ab initio reconstruction of 
the sequence in each component was obtained 
by applying our whole-genome assembly algo- 
rithm to the partitioned, relevant Celera data and 
the shredded, faux reads of the partitioned, rel- 
evant bactig data. 

2.3 Whole-genome assembly 

The algorithms used for whole-genome as- 
sembly (WGA) of the human genome were 
enhancements to those used to produce the 
sequence of the Drosophila genome reported 
in detail in (28). 

The WGA assembler consists of a pipeline 
composed of five principal stages: Scree ner. 
Overlapper, Unitigger, Scaffolder, and Repeal 
Resolver, respectively. The Screcner find* 
and marks all microsateliite repeats with less 
than a 6-bp element, and^ screens out »H 
known interspersed repeat elements, includ- 
ing Alu, Line, and ribosomal DNA. Marked 
regions get searched for overlaps, whereas 
screened regions do not get searched, but can 
be part of an overlap that involves unscreened 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
days in elapsed time with 40 such machines 
operatirig in parallel. . . 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are 
the uncontested interval subgraphs of . the 
graph of all overlaps (42).. Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 



The Human Genome 

singly interspersed Alu, elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with, respect to each other, the 
probability^ of this being wrong is again' 
roughly 1 in 10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by coiifirming 50-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the. majority of the unique sequence within a 
genome. 

For the Drosophila assembly, we engaged 
in a three-stage repeat resolution strategy 
where, each stage-was progressively more 
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aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the . probability of inserting a unitig into an 
^correct gap with this strategy to be less than 
10 based on a probabilistic analysis 

We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43), For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromo- 
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some 22, all stones were placed correctly 

«..? C meth0d of living gaps is to 
fill them with assembled B AC data that cover 
the gap. We call this external gap "walking » 
We did not include the very aggressive "Peb- 
bles substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality was only 
99.62% correct We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 
; At the final stage of the assembly process 
and also at several intermediate points a 
consensus sequence of every contig is pro- 
duced Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
toe correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structag subroutines. In addition, memory was . 
a rea! issue-a straightforward application of 
the software we had built for Drosophila would 
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have required a computer with a 600-gigabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired. For our assembly operations, the total 
compute infrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire) The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff or 
set of reads not incorporated in the assembly 
numbered 11.27 million (26%), which is con- 
sistent with our experience for Drosophila 
More than 84% of the genome was covered bv 
-scaffolds ;>100,kbp long, and these averaged 

l£,*3- U ? ^ 9% *** a total of 
zuri Gbp of sequence. There were a total of 
y3,857 gaps among the 1637 scaffolds > 100 
kbp. The average scaffold size was 1 5 Mbp 
the average contig size was 24.06 kbp, and the 
average gap size was 2.43 kbp, where the dis- 
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sembled mdividually. We expected thaTrnts 
would help m resolution of large interchm- 
mosomal duplications and improve £££ 

menS r ™* «W 

mentabzed assembly process involved clus- 
tering Celera reads and bactigs into large 
multiple megabase regions of the genome 

SlZl Shredded > faux r «ds ob- 
tamed from the bactig data. 

The first phase oftheCSA strategy was to 
separate Celera reads into those that matched 
me BAC contigs for a particular PFP BAC 
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No. of bp in scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps <n kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 
(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp in scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps =si kbp 
Average scaffold size (bp) 
Average contig size (bp) 
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% of total contigs 



2.905,568,203 

2,653,979,733 
53,591 
170,033 
116,442 
72,091 
54,217 
15,609 
2,161 

1*988,321 
100 

2.847,890,390 

2,586,634,108 
118,968 
221,036 
102,068 
62,356 
23,938 
11,702 
2,560 

1.224,073 
100 



Compartmentalized shotgun assembly 

2.748,892,430 2,700,489,906 



2,524,251,302 
2,845 
112,207 
109,362 
69,175 
966,219 
22,496 
2,054 

1.988,321 
95 

Whole-genome assembly 
. 2,574,792,618 

2,334,343,339 . 
2,507 
99,189 
96,682 
60,343 
1.027,041 
23,534 
2,487 



2.491,538,372 
1,935 
107,199 
' 105,264 
67,289 
1,395,602 
23,242 
1.985 

1.988,321 
94 

2,52S,334,447 

2,297,678,935 
1,637 
95,494 
93,857 
59,156 
1.542,660 
24,061 
2,426 



2,489,357,260 

2,320,648,201 
1.060 
93,138 
92,078 
59,915 
2,348,450 
24,916 
1.832 

1,988,321 
87 

2,328,535.466 

2,143,002,184 
818 
84,641 
83,823 
54,079 
2,846,620 
25,319 
2,213 
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2,248,689,128 

2,106,521,902 
721 
82,009 
81,288 
53,354 
3,118,848 
25,686 
1,749 

1,988321 
79 

2,140,943,032 

1,983,305,432 
554 
76,285 
75,731 
49,592 
3,864,518 
25,999 
2.082 

1,224,073 
77 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 million 
reads, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAG because their mate matched the bactig. 
Of the remaining reads, 2.92 million were 
completely screened out and so could not be 
matched, but the other 2.97 million reads had 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5.1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
have hot been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



assembly took place, but not enough Celera 
data were matched fo truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and IX light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3 X light-shotgun of 
each BAC is needed. . .. 

The 5.89 million Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
scaffolds for every BAC region constituting 
at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to deterrnine the order and over- 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, 
and B AC-end pairs (18) and sequence tagged 
site (STS) markers (44) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



Chimeric or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region, 

. WGA assembly of the components result- 
ed in a set of scaffolds totaling 2;906 Gbp in 
span and consisting of 2.654 Gbp of se- 
quence. The chafF, or set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
size ranges. Consider also that more than 
. 49% of all gaps were <500 bp long, more 
than 62% of all gaps were < 1 kbp, and all 
gaps are < 100 kbp long. Similarly, more than 
73% of the sequence is in contigs > 30 kbp, 
more than 49% is in contigs > 100 kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads, or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
Thus, another analysis was conducted in 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches having a 
consistent order and orientation. This gives 
some measure of consistent coverage- 1 982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by mis more stringent 
measure. , ? 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. An initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems 
whereas the WGA is performing a shotgun' 
assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. ■ 



2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by .examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- 
ers. There are two genome-wide types of map 
information available: high-density STS maps 
and fingerprint maps of BAC clones developed 
. at Washington University (45). Among the ge- 
nome-wide STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived from well-validated genetic 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 

^Pf d ° D a meient chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five frame- 
work bins. However, for the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4 21% 
of STSs were discordant by more than five 
framework bins), but a lower discordance rate 
with the fingerprint maps (11% 0 f BACs 
disagreed with fingerprint maps by more than 
five BACs). This observation agrees with the 
clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps, Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple mapped markers with 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be "unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99. These scaffolds were termed 
"ordered scaffolds.*' We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 

Next, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same BAC cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, —98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped -scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the 
chromosome. 

During the scaffold-mapping effort, we en- 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 
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2.7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
known with absolute certainty until the eu- 
chromatin sequence has been completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage . of an 
independent set . of random sequences (STS 
markers) contained in the assembly. The 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 57). 

The sequences of human chromosomes 21 
and 22 have. been completed to high quality 
and published (48, 49). Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or assembly errors in the 
BAC data.:In particular,; the assembler must 
be able to resolve repetitive elements at the 
scale of components , (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap. sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness is to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) and BLAST (54) were 
used to locate STSs on the assembled genome 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by searching the unas- 
sembled data or "chaff." We identified 1283 
STS markers (2.6%) not found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method. We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
Celera data, and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 
Table 4. Summary of scaffold mapping. Scaffolds 
were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and CM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, CM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 
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sembly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
quencing reads should be located on the con- 
sensus sequence with the correct separation 
and orientation between the pairs. A pair is 
termed 'Valid" when the reads are. in the 
correct orientation and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described 
above. To validate these, we . examined all 
reads mapped to the finished sequence of 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of the ge- • 
nome cloned into the same plasmid), and how 
tight the distribution of insert sizes was for 
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those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(~ 10%). Thus, although the mate-pair infor- 
mation was not perfect, its accuracy was such 
that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was deemed to be a reliable instrument 
for validation purposes, especially when sev- 
eral mate pairs confirm or deny an ordering. 

The clone coverage of the genome was 
39 X, meaning that any given base pair was, 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In 
summary, for scaffolds . >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3X clone coverage. Thus, 
more than 99% of the assembly, including 
order -and orientation, is strongly supported 
by this measure alone. 

We examined the locations and number of . 
all misoriented and 1 misseparated mates. In 
addition to doing this analysis on the CSA 
assembly (as of 1 October 2000), we also 
performed a study of the PFP assembly as of 



5 September 2000 (30, 55b). In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fideh'ty repeats, the only pa i rs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five .or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped rejiably. Figures 
6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert , libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
■ simply because they span a larger segment of 
the genome. The . graphic comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Jh m- u !' pair val,dat,on ' Celera fra S™nt sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested), if the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 
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be done with reasonable accuracy when a 
full-length cDNA has been sequenced or a 
Sghly homologous protein sequence is 
De novo gene prediction alth^gh 
less accurate, is the only way to find genes 
that are not represented by homologous pro- 
teins or ESTs. The following section de- 
scribes the methods we have developed to 
adtoss these problems for the prediction of 

protein-coding genes. 

We have developed a rule-based expert sys- 
tem called Otto, to identify and characterize 
— tote human genome ((50). Otto attempts 
r^ulatem^meprocessmatab^ 

annotator uses to identify a gene and refine its 
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foSy, gene boundaries are predicted on 
theSs of examination of sets of overlap- 
ping protein and EST matches generated by ^a 
Computational pipeline {62). This jpebne 
searches the scaffold sequences against pro- 
SnEST and genome-sequence databases to 
£» re^ns of sequence similanty and 
JSLJde novo gene-prediction programs 
To identify likely gene boundaries, re 
gions of the genome were P^' 0 ^.^ 
™ the basis of sequence matches identtfied 
Z BLAST La of the database sequences 
2S-£ the regi on under ^analysis was 
compared by an algorithm that takes imo 



sciencemag.org 



SCIENCE VOL 291 16 FEBRUARY 2001 



gerie boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins " each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a fulWength cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a . transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.orfi/cet/ 
content/full/291/5507/1304/DC1, 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- 
tor between mouse and human genomic 
ONA, similarity to human transcripts (ESTs 
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andcDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 1 0 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 



THE HUMAN GENOME 

those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain- 3 ' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
means of sequence similarity. 



Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (A/) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 



Otto (RefSeq only)* 0.939 
Otto (homology)t 0.604 
Genscan 0.501 



Sensitivity Specificity 



0.973 
0.884 
0.633 



*Refers to those annotations produced by Otto using only 
the Sim4-pollshed RefSeq atignment rather than an evi- 
dence-based Genscan prediction. tRefers to those 
annotations produced by supplying all available evidence 
to Genscan. 



3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
. structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
. ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
• was a unique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto 
uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript. We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there , was not sufficient 
sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which -76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap kriown genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (1 7,764), 39,1 14, 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to -23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence, types— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs— or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for fiirther analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
chromosome diagrams in Fig. 1. These are a 
very preliminary set of annotations and are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

, Summary. This section describes several of 
the noncoding attributes of the assembled 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 
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4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
: most visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (64). Much of this hetero- 
chromatin is highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs with various higher order repeat 
structures (65). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly.* 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
?n «JESj*.I! a p 6 ?c Um F ° f overla PP in 8 Ce ~ can ' Ott0 (ReKeq only) annotations based solely 
flnM P hed -, R M Seq . a J'6 nrTients ' and °"° (homology) annotations (annotations produced by 
supply ng all available evidence to Genscan) were tallied. These data show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with a single RefSeo 
transcript The zero class for the Otto-homology predictions shown here indicate! that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence °" 



Total 



Types of evidence 



No. of lines of evidence* 



Mouse 



Rodent 



Protein 



Human 



Otto 


Number of 
transcripts 

Number of 
exons 


17,969 
141,218 


17,065 
111,174 


14,881 
89,569 


15,477 
108,431 


16,374 
118,869 


17,968| 
140,710 


17,501 
127,955 


15,877 
99,574 


=** 
12,451 
59,804 


De novo 


Number of 
transcripts . 

Number of 
exons 


58,032 
319,935 


14,463 
48,594 


5,094 
19,344 


8,043 
26,264 


9,220 
• 40,104 


21,350 
79,148 


8.619 

31,130 


4,947 
17.508 


1,904 
6,520 


No. of exons per 
transcript 

*Four kinds of evident 


Otto 
De novo 

e lrnn<en/atinn in 3 V 


7.84 
5.53 


5.77 
3.17 


6.01 
3.80 


6.99 
3.27 


7.24 
4.36 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 
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Examination of pericentromeric regions is 
ongoing. 

The remaining -80% of the genome, the 
euchromatic component, is divisible into G- 
R-, and T-bands (67). These cytogenetic bands' 
nave been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular leveL T-bands arc 
the most G+C- and gene-rich, and G-bands arc 
G+C-poor (68). Bemaidi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI 
H2, and H3), which are >300 kbp in length 
(o^ Bemardi denned the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
m the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of 

?7to^ nteDt , >48% ^ av «ag«i 
273.9 kbp m length, those with G+C content 

^^ 4 o,? d 48% (H1 +H2 k^*™) aver- 
aged 202.8 kbp in length, and the average span 

^ <43% & Chores) was 
iu/8.6 kbp. The correlation between G+C 
content and gene density was also examined in 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
nigh G+C than in regions of low G+C content, 
as expected. However, the correlation between 
O+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4 
18, 1 3, and Y, also have the fewest H3 bands' 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 

How valid is Ohno's postulate (71) that 
mammalian genomes consist of oases of genes 
m otherwise essentially empty deserts? It ap- 
pears that the human genome does indeed con- 
tern deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
gene, then we see that 605 Mbp, or about 20% 
of the genome, is in deserts. These are not 
umfonnly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19 and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes^ 13, 18,andXhave27.5%oftheir492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
«sanly imply that they are devoid of biological 



4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
es ofgenes. The distance metric, centimorgans 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during meio- 



sis. fi, general, the rate of recombination in 
females is greater than that in males, and thS 
degree of map expansion is not uniform across 
th genome (72). One. of the opportunitieTS 

to produce the ultimate physical map, and to 
My analyze its correspondence withlwo other 
maps that have been widely used in gLome 
and genetic analysis: the linkage map S 2 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We mapped the location of the markers 
tiiat constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as ^cM per Mbp, was calculated for 
3-Mbp windows as shown in Table 12 High- 
er rates of recombination in the telomeric 
region of ^ chromosomes have £££ 

° U * £ *™««! (75). From this mapping 
result, there is a difference of 4.99 berweef 
towest rates and highest rates and the latest 

(4 99to04 O 7 ^ b , etWeeD ^ "* 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in reimbination 
rates among regions of the genome exceeds 
fte differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 

^ P ' sothe P ic toeonegetsofthe 
magnitude of variability in recombination 
rate will depend on the size of the window 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21,350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 

gene predictions 

skewed much 

toward smaller Ma(I . 

scripts. In the Otto set 

19.7% of the tran 

scripts have one 
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examined. Unfortunately, too few meiotic 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to determine a< sequence basis of 
recombination at the chromosomal level An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage 
such as in positional cloning projects. 
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4.3 Correlation between CpC islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dtnucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5 end of the transcript (75, 76). In addition 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting {78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 

(7 a I 0 *? iSl2nds in ^ h^an genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (81). Larsen et 
al(76) and Gardiner-Garden and Frommer 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide £0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island 
with .gene starts, given a set of annotated 
genomic transcripts arid the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22 as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et al. (76). The main differences are 
that we use a sliding window of 200 bp 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used.two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
.higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 



CpG Bland. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
toe corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. 

. We also looked at the distribution of CpG 
island nucleotides among various sequence 
classes such as intergenic regions, btrons 
exons and first exons. We computedTe' 
ikehhopd score for each sequence class as 
the ratio of the observed fraction of CoG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
iirst coding exons. 
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4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83) Repet- 
itive sequence may be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8 /o of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LfNEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 (continued). Relation among gene density (orange), C+C content 
(green) EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 
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dows. The percent of G+C nucleotides was calculated in 
windows. The number of ESTs and Alu elements is shown per 
window. 
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5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed {84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. 
: We believe, that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/full/29 1/5507/1 304/DC1) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the 
retrotransposition of a five exon^containing 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14 
respectively. The size of the source genes can' 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translational 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes! 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue 7 specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 
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5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 



pressed We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 
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that account for gene bactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted 
transcripts were excluded irom this analysis 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 
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pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
content did not show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), Iamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
terns (2%). The increased occurrence of 
retrotransposition (both intronless paralogs. 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 
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Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assemWy sequence. 
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The complete clusters that result from the 
Lek clustering provide one basis for compar- 
mg the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. . The variance of each organ- 
ism s contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 1 2 the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 

clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
■ slope spread covering both human and fly/ 
worm predominance, as we observed (Fig 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
uidividual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However 
in our analysis, the difference between an' 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Urge-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family- based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 



termined to be in the same family and the 
same complete Lek cluster (essentially 
paralogous genes) {89). Initially, each chro- 
, mbsome was represented as a string of genes 
ordered by the start codons for predicted 
genes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster (89), All 
pairs of. indexed gene strings were then 
aligned in both the forward arid reverse di- 
rections with the Smith-Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of -4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 
of 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 
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filtering methods,, a shuffled protein set was 
first created by taking the 26,588 proteins, 
mdomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
; real and the shuffled data, with the results on 
the shuffled data being used to estimate the 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 
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tions at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 
The proteins are not contiguous but span a 
region containing "97 proteins on chromo- 
some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X 10" 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset); This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fig. 12. Gene duplication in complete protein dusters. The predicted protein sets of human worm 
and fly were subjected to Lek clustering (27). The numbers of dusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per duster were plotted 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 
20 to 30%. This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As' an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
duplication in fact best explains many of the 
blocks detected by this genome-wide analysis. 
The regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse , 
■chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
human duplication assignments were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale .duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear.to predate the two species* divergence. 
This dates the duplications, at the latest, before 
divergence of the primate and rodent lineages. 
This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96), The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genome ^ 
with it a history of the emergence of 
the key funcuons that distinguish us from 0 Z 
living things. 0Ulcr 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to rientffy single-nucleotide polymorphism, 
(SNPs) by comparison of the Celera sequent 
to other SNP resources. The SNP rate be. 
tween two chromosomes was - 1 per 1 *>00 to 
1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the fun/, 
tional analysis of SNPs that affect the pre* 
dieted coding regions. This results in an c»- 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
rural diversity of human proteins. 

Having a complete genome sequence enables 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
pared the distribution and attributes of SNPs 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence in the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (07), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
"TSC"; 632,640 SNPs) (98). These data were 
consistent in showing an overall nucleotide di- 
versity of -8 X 10 -4 , marked heterogeneity 
across the genome in SNP density, and an 
overwhelrning preponderance of noncoding 
variation that produces no change in expressed 
proteins. 



6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2 : 1 ratio 
has been well documented as typical in mnm- 
malian evolution (J 00) and in human SNPs 
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{101,102). The filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
hon-tc-transversion ratio from 157-1 to 
1.89:1. Whenappliedto2.3 Gbp of alignments 
between the Celera and PFP consensus se- 
^T,™ £f ^ resulted m identification 

9 77» i^ °u PUtative ^ from a ^ of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those foundry 
other methods are described below. 

6.2 Comparisons to public SNP 
databases 

i^m n , al SNPS ' incIudin g 2,536,021 from 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103) The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded A total of 

Alt nil dbSNP , Variants were ^ped to 
1,223,038 unique locations on the Celera se- 
quence, implying considerable redundancy in 
dbSNP SNPsintheTSCset rnappeYto 
* £ I mque genomic 'orations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
m this analysis, including Celera-PFP TSC 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
Aese methods was also found by another meth- 
od. The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the^e by Kwok of sequences that went into 

(16.4%) between the Kwok and TSC sets is due 



SN?dJ a 5 h a 0Ver I? P . 1 0f SNPs from Senome-wide 
SNP databases. Table entries are SNP counts for 

the fract on of overlap, calculated as the count of 
overiapprng SNPs divided by the number of SNPs 

TotS 5 ™ a " er ° f , tHe ^ databases ^P^d 

PFP 2 Zfi C ? °o n Tc/ 0r the databases are: C«lera- 
W. 2,104.820; TSC, 585.811; and Kwok 438,032 
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Si? "S? ^ SmaUeSit *"> ^ *» "^n. 
245% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process 
so coronation on multiple data sets may pro-' 
vide an efficient initial validation "in silico" (bv 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
or human variation is to tally the frequen- 
cies of the six possible base changes in 
each setof SNPs (Table 16). Previous mea- 
sures of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale . 
♦iTct™ " markab,e homogeneity between 
the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
m this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
J* sets - Th's result is not unexpected 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2:1 transition:transversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used ir the 

rTn^t S f iStiC f0r nuc 'e°tide diversity 
(104). Nucleotide diversity is a measure of 
per-site heterozygosity, quantifying the 
probability that a pair of chromosomes- 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
m methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



site. These data are not readily available so 
T C °" ld T n °!. esti ^e nucleotide divers^ 
from the TSC effort. Estimation of nucleo 
tide diversity from high-quality sequence 
overiaps ; should be possible, but 'again 
more mforrnation is needed on the detail 
of all the alignments. 

Estimation of nucleotide diversity from a 

column of the multiahgnment, the probability 
that tvo or more distinct alleles. are present 
and the probability of detecting a SN? f 2 
fact the alleles have different sequence . 
Ae probability of correct sequence calls). The 
greater the depth of coverage and the highe 
the sequence quality, th e ^ me ch f 
of successfully detecting a SNP (10S).lvZ 
after correcting for variation in coverage the 
nucleotide diversity appeared to varytcros 
autosomes. The significance of this^teroge- 
neity was tested by analysis of variance with 
«sof w forl00-kbpwmdowsto2 
£? V S ,llty chr omosomes (for the 
Celera-PFP comparison, F = 29 .73, P < 

tim^T f ° r the aut °somes es- 

ITs 8 94 x m ,n-4 ? le r PFP com P-son 
was 8.94 X 10 «. Nucleotide diversity on 
tne A chromosome was 6.54 X 10" 4 The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 

th~?Y °l m p °P ulation . Aere are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102, 106, 107). Genome-wide 

8°98 rif 0 ^ diversitv was ' 

8.98 X 10-4 for the Celera-PFP alignment 

and a published estimate averaged over 10 

iT^o-^T human genes was 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 

Table 16. Summary of nucleotide changes in different SNP data sets. 
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Celera-PFP 
TSC 


188,694 
(0.322) 


158,532 
(0362) 

72,024 
(0.164) 
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Fig. 13. Segmental duplica- 
tions between chromo- 
somes in the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
pairs of genes in totaL Each 
line represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
tom within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
dose-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral 
coalescent (109): Applying well-tested algo- 
rithms for simulating the neutral coalescent 
with recombination (110), and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (111), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant . 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 
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otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

To test homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic (missense and silent), in- 
tronic, and 3'-UTR for 10,239 known 
genes, derived from the NCBI RefSeq da- 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios , are com- 
parable to the missense-to-silent ratios of 
0.88 and 1.17 found- by Cargill et ah (101) 
and by Halushka et al. (102). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 
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Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and TSC). Intergenic re- 
gions have been virtually unstudied (113), and 
we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first dscriminators 
• between these two classes of DNA. These SNP 
rates were confirmed in the Celera SNPs, which 
also exhibited a lower rate in exons than in 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 
provide valuable information in the form of 
markers for linkage and association studies, and 
some fraction is likely to have a regulatory 
. function as well. 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with 
other fully, sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 



A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
tions (some human genes will hot be computa- 
tionally predicted). We also expect errors in 
^limiting the boundaries of exons and genes 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. Hie functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
ods? (li) What are the core functions that 
appear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 



7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at least 
two lines of supporting evidence. About 
41% (12,809) of the. gene products could 
not be classified from this initial analysis 
and are termed proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these . 636 predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting that the majority of 



these unknown-function genes are not real 
M Given that most of these additional 
12,095 . genes appear , to be unique among the 
genomes sequenced to date, many may simply 
represent false-positive gene predictions 

The most common molecular functions are 
toe transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme) 
Other functions that are highly represented in 
toe human genome are the receptors, kinases, 
and hydrolases. Not .surprisingly, most of the 
hydrolases are proteases. There are also many 
proteins that are members of proto-oncogene 
families, as well as families of "select remila- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
tnmenc GTP-bindmg proteins (G proteins) and 
cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

TaMe 17. Distribution of SNPs in classes of 
genomic regions. 
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Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene * Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celeras Panther mo- 
lecular function cate- 
gories (776). 
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7.2 Evolutionary conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae ("bak- 
ers' yeast") {118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (25), as well as the 
first plant genome, A. thaliana, recently com- 
pleted {92% provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 
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(720), we identified two different cases for 
each pairwise comparison (human-fly and 
human- worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
more than one member in either or both of the 
organisms being compared. Chervitz et al. 
(120) deal with this case by analyzing a 
phyiogenetic tree that described the relation- 
ships between all of the sequences in both 
organisms, and then looked for pairs of genes 
that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from 
pairwise sequence comparison without hav- 
ing to examine a phyiogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been 
-a paralogous. expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- 
tein set, we could not answer this question for 
every predicted protein. -Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
start human-fly orthologs, 2031 human- 
worm (1523 m common between these sets) 

iLi'^ u eVoIutionariI V "nserved set as 
those 1523 human proteins that have strict 
orthologs -in both D, melanogaster and C 
elegans, \ ' ' 

The distribution of the functions of the 
conserved protein set is shown in Fig 16 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
not distributed among molecular functions in 
the same way as the whole human protein set 
Compared with the whole human set (Fig* 
15), there are several categories that are over- 
represented in the conserved set by a factor of 
-2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases DNA/RNA polymerases, helicases, 
DNA hgases, DNA- and RNA-processing 
Actors, nucleases, and ribosomal proteins) 
Tne^basic transcriptional and translation^ 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear - to be conserved among the animals 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases 
lyases, and isomerases). Many of these en- 



nucleic acid enzyme (22 J , 12.9%) 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BIASTP P-value of S10" 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 

any paralogs in either organ- 
ism, i.e., there has likely been 

no duplication subsequent to 

speciation that might make 

the orthology ambiguous. This 

measure is quite strict and is a 

lower bound on the number of 

orthologs. By these criteria, , 

there are 2758 strict human- «S"latory molecule (88, 5.1%) 

fly orthologs, and 2031 hu- 
man-worm orthologs (1 523 in 

common between these sets). intense (70, 4 \%) 



cytoskeletal structural protein (20, J 2%) 
chaperonc((6,0.9%) t 
cell adhesion (1 1, 0.6%) 
miscellaneous (72,4.2%) 
viral protein (4, 0.2%) x 
transfer/carrier protein (II, 0.6%) s 
transcription factor (8 1 , 4.7%) , 



sctraccllular matrix (12, 0.7%) 
ion channel (7, 0.4%) 
motor (13, 0.8%) 

structural protein of muscle (8, 0.5%) 
protooncogene(23, 1.3%) 

intracellular transporter (51 , 3.0%) 

transporter (44, 2.6%) 



receptor (23, IJ%) 
kinase (69,4.0%) 




synthase and synthetase (64, 3.7%) 

oxidorcductase (64, 3.7%) 

lyase (12, 0.7%) 
ligasc (9, 0.3%) 



molecular function unknown (613. 35.8%) 



hydrolase (80, 4.7%) 
isomcrose (21. 1.2%) 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
form the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major conserved families are 
small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The . 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con- 
served protein families. 
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73 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
in vertebrates. We observe 22 class I and 22 
class . II major. {histocompatibility complex 
(MHC) antigen genes and 114 other immu- 
noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main. level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to : constitute molecules such as 
MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family of secreted 4-aIpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly .represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 

Neural development, structure, and 
function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved. in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
basis for electrical coupling. -Path way find- 
ing by axons and neuronal network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (123). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their receptors (neuropi- 
lins and plexins) is that of axonal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2+ sensor (or receptor) during synaptic 
vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key rple in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin P0 is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in K sapiens (H) 
0. melanogaster (F), C etegans (W), S. cerevisiae (Y), and A thatiana (A) The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



more than one cellular process. Results of the Pfam analysis may differ from 
results obtained based on.human curation of protein families o^Z tl tZ 
imitations of large-scale automatic classifications. f^JS^S^£^ 
of domains with reduced counts owing to the stringent E valu 
this analysis are marked with a double asterisk (**) ExamSL^ u 
divergent and predominantly alpha-helical 

cysteine-rich zinc finger proteins. ■ * sses of 




PF02039 
PF00212 
PF00028 
PF00214 
PF01110 
PF01093 
PF00029 
PF00976 
PF00473 
PF00007 
PF00778 
PF00322 
PF00812 
PF01404 
PF00167 
. PF01534 
PF00236 
PF01153 
PF01271 
PF02058 
PF00049 
PF00219 
PF02024 
PF00193 
PF00243 
PF02158 
PF00184 
PF02070 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 
PF00103 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 
PF01099 
PF01160 
PF00110 

PF01821 

PF00386 

PF00200 

PF00754 

PF01410 
. PF00039 

PFO0O4O 

PF00051 

PF01823 

PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



Adrenomedullin 
ANP 
Cadherin 
Calc.CGRPJAPP 
CNTF 

.Xlusterin . 
Connexin 
ACTH_domain 
CRF 

Cys_knot 
DIX 

Endothelin 
Ephrin 
EPh.lbd 
FGF 
Frizzled 
Hormone6 
Clypican 
Cranin 
Cuanylin 
Insulin 
ICFBP 
Leptin 
Xlink 
NCF 

Neuregulin 
HormoneS 
NMU 
Notch 

Osteopontin 
Hormone3 
Parathyroid 
Hormone2 
PDGF 
Sema 

Somatomedin_B 
Hormone 
Sorb 
SCF 

Syndecan 
TNFR_c6 
TGF-p 
Uteroglobin 
Opiodsjieuropep 
Wnt 

ANATO 
Clq 

Disintegrin 
F5_F8_type_C 
COLFI 
Fn1 
Fn2 
Kringie 
MACPF 
Pentaxin 
SAA_j>rotetns 
Sushi 
TSPN 
Tissue_fac 
Transglutamin_N 
Transglutamin.C 



Developmental and homeostatic 

Adrenomedullin 
Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CGRP/IAPP family 
Ciliary neurotrophic factor 
Clusterin 
Connexin 

Corticotropin ACTH domain 
Corticotropin-releasing factor family 
Cystine-knot domain 
Dix domain 
Endothelin family 
Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 

Frizzied/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

Grainin (chromogranin or secretogranin) 
Guanylin precursor 
lnsulin/(GF/Relaxin family 
Insulin-like growth factor binding proteins 
Leptin 

LINK (hyaluron. binding) 
Nerve growth factor family 
Neuregulin family 
Neurohypophysial hormones 
Neuromedin U 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 
Stem cell factor 
Syndecan domain 
TNFR/NGFR cysteine-rich region 
Transforming growth factor 0-like domain 
Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 
Clq domain 
Disintegrin 
F5/8 type C domain 
Fibrillar collagen C-terminal domain 
Fibrqnectin type I domain 
Fibronectin type II domain 
Kringie domain 
MAC/Perforin domain 
Pentaxin family 
Serum amyloid A protein 
Sushi domain (SCR repeat) 
Thrombospondin N-terminal-like domains 
Tissue factor 
Transglutaminase family 
Transglutaminase family 



regulators 

1 
2 

100(550) 
3 
1 
3 

• • 14(16) 
1 
2 

10(11) 
5 

7(8) 
12 
23 
9 
1 

14 
3 
1 
7 

10 

13(23) 
3 
4 
1 
1 

3(5) 
1 
3 

5(9) 
5 

27(29) 
5(8) 
1 
2 
2 
3 

17(31) 
27(28) 

3 

3 
18 

6(14) 
24 
18 
15(20) 
10 

5(18) : 

11(16) 
15(24) 

6 

9 

4 

53(191) 
14 
1 
6 
8 



0 
0 

14(157) 
0 
0 
0 
0 
0 
1 
2 
2 
0 
2 
2 
1 
7 
0 
2 
0 
0 
4 
0 
0 
0 
0 
0 
0 
0 

2(4) 
0 
0 
0 
0 

1 

8(10) 
3 
0 
0 
0 
1 
1 
6 
0 
0 

700) 

0 
0 
2 

5(6) 

. . 0 
0 

b 

2 
0 
0 
0 

11(42) 
1 
0 
1 
1 



0 
0 

16(66) 
0 
0 
0 
0 
0 
0 
0 
4 
0 
4 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 

1 

0 
0 
0 
0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 



O 
0 
3 
2 
0 
0 
0 
2 
0 
O 
0 

8(45) 
O 
0 
0 
0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0' 
0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

b 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 
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Accession 
number 



Domain name 



Domain description 



PF00594 



G(a 



PF00711 
PF00748 
PF00666 
PF00129. 

PF00993 
PF00969 
PF00879 
PF01109 
PF00047 
PF00143 
PF00714 
PF00726 
PF02372 
PF00715 
PF00727 
PF02025 
PF01415 
PF00340 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PF002 77 
PF00048 

PF01582 
PF00229 
PF00088 

PF00779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 
PF00996 
PF00503 
PF00631 
PF00616 
PF00618 

PF0O625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 



PF00640 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 
PF00788 
PF00071 
PF00617 
PF00615 
PF02197 



Defensin_beta 
Calpain.inhib 
Cathelicidins 
MHCJ 

MHCJLalpha** 
MHCJI.beta** 
Defensin_propep 
GM.CSF 

Ifi 

Interferon 
IFN-gamma 
IL10 
IL15 
IL2 
IL4 
IL5 
IL7 
111 

IL1_propep 
IL3 
IL6 

LIF.OSM 

Defensins 
PTN.MK 
SAA__proteins 
IL8 

TIR 
TNF 
Trefoil 

BTK 
C2 

DAGKa 
DACKc 
DEP 

FYVE 
GDI 

G-alpha 
G-gamma 
RasGAP 
RasGEFN 

Guanylatejcin 
ITAM 
PH 

DAG.PE-bind 
PI-PLC-X 
PI-PLC-Y 



W 



PiD 

PI3iep85B 
PI3K.rbd 
ArfGAP 
RBD 

Rap.GAP 

RA 

Ras 

RasCEF 

RGS 

Rlla 



Vitamin K-dependent carboxylation/gamma- 
carboxyglutamic (CIA) domain 

D . . t , Immune response 

Beta defensin r 

Calpain inhibitor repeat 

Cathelicidins . 

Class I histocompatibility antigea domains alpha i 
and 2 r 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-10 

lnterleukin-15 

lnterleukin-2 

lnterleukin-4 

lnteiieukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterieukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (LIF)/oncostatin (OSM) 

family ' 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family 
Trefoil (P-type) domain 

DT „ _ PI-PY-rho CTPase signaling 

BTK motif 3 * 

C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10 and 
Pleckstrin (DEP) 

FYVE zinc finger 

GDP dissociation inhibitor 

G-protein alpha subunit 

G-protein gamma like domains 

GTPase-activator protein for Ras-like GTPase 

Guanine nucleotide exchange factor for Ras-like 
GTPases; N-terminal motif 

Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (C1 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family, p85-binding domain 
PI3-kinase family ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 



11 



5 

73 (101) 
9 
10 
12(13) 

28 (30) 
6 

27(30) 
16 
11 
9 

12 
3 

193(212) 
45(56) 

12 

11 

24(27) 
2 
6 
16 
6(7) 
5 

18(19) 
126 
21 
27 
4 



1 


0 


3(9) 


o 


2 


0 


18(20) 


o 


5(6) 


0 


7 


0 


3 


0 


1 


0 




125(291) 


7(9) 


0 


1 


0 


1 


0 


1 


0 


1 


0 


1 


0 


1 


0 


1 


0 


7 


0 


1 


0 


1 


0 


2 


0 


2 


0 


2 


0 


2 


0 


4 


0 


32 


0 


18 


8 


12 


0 


5(6) 


0 



1 

32 (44) 
4 
8 
4 

14 
2 

10 
5 
5 
2 

8 
0 

72(78) 
25(31) 



13 
1 
3 
9 
4 
4 

7(9) 
56(57) 
8 

6(7) 
1 



0 
0 
0 

. 0 

0 
0 
0 
0 

67(323) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 



11(12) 

1 
1 

8 

1 

2 
6 
51 
7 

12(13) 
2 



0 
0 
0 
6 
0 
0 

1 

23 
5 
1 
1 



0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


2 


0 


131 (143) 


0 


0 


. 0 


2 


0 


0 


0 


0 


0 


24(35) 


6(9) 


66(90) 


7 


0 


6 


8 


2 


11(12) 


10 


5 


2 


15 


5 


15 


1 


1 


3 


20(23) 


2 


5 


5 


1 


0 


8 


3 


0 


3 


5 


0 


7 


1 


4 


0 


0 


0 


65 (68) 


24 


23 


26 (40) 


1(2) , 


4 


7 


1 


8 


7 


1 


8 



0 
0 
0 
15 
0 
0 
0 
78 
0 
0 
0 
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Table 18 (Continued) 



the Human genome 



Accession 
number 

PF00620 
PF00621 
PF00536 
PF01369 
PF00017 
PF00018 
PF01017 
PF00790 
PF00568 

PF004S2 
PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 
PF00191 
PF00402 
PF00373 
PF00880 
PF00681 
PF0043S 
PF00418 
PF00992 
PF02209 
PF01044 

PF01391 
PF01413 

PF00431 
PF00008 
PF00147 



Domain name 



Domain description 



PF00041 

PF00757 

PF00357 

PF00362 

PF00052 

PF00053 

PF00054 

PF00055 

PF00059 

PF01463 

PF01462 

PF00057 

PF00058 

PF00530 

PF00084 

PF00090 

PF00092 

PF00093 

PF00094 

PF00244 

PF00023 

PF00514 

PF00168 

PF00027 

PF01556 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



RhoGAP 
RhoGEF 
SAM 
Sec7 
SH2 
SH3 
STAT 
VHS 
WH1 

Bd-2 

BH4 

CARD 

Death 

DED 

BAG 

ICE_p20 

BIR 

Actin 

Annexin 

Catponin 

Band.41 

Nebulin_repeat 

Plectin_repeat 

Spectrin 

Tubulin-binding 

Troponin 

VHP 

Vinculin 



Collagen 
C4 

CUB 
EGF 

Fibrinogen_C 



Fn3 

Furin-like 

Integrin.A 

IntegrinJJ 

Laminin_B 

Laminin_EGF 

Laminin_G 

Laminin_Nterm 

Lectin_c 

LRRCT 

LRRNT 

LdLrecept_a 

Ldl_recept_b 

SRCR 

Sushi 

Tsp_1 

Vwa 

Vwc 

Vwd 

14-3-3 
Ank 

Armadillo_seg 
C2 

cNMPJ>inding 
DnaJ_C 
DnaJ 
Efhand** 
FCH 
FF 
FHA 



RhoGAP domain 
RhoGEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WH1 domain 

Domains involved in apoptosis 

BcI-2 

Bd-2 homology region 4 
Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present in Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

. . " . Cytoskeletat 

Actin 

Annexin 

Calponin family 

FERM domain (Band 4.1 family) 
Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

- „ . , , f CM adhesion 
Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 
globular domain 

Fibronectin type III domain 

Furin-like cysteine rich region 

Integrin alpha cytoplasmic region 

Integrins, beta chain 

Laminin B (Domain IV) 

Laminin EGF-like (Domains III and V) 
Laminin G domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D domain 

- '„ . Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain . 

Cyclic nudeotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand . 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



H 

59 
46 
29(31) 
13 

87(95) 
143(182) 
7 
4 
7 

9 
3 
16 
16 
4(5) 
5(8) 
11 
8(14) 

61 (64) 
16(55) 
13(22) 
29(30) 
4(148) 
2(11) 
31 (195) 
4(12) 
4 
5 
4 

65(279) 
6(11) 



47(69) 
108 (420) 
26 

106(545) 
5 
3 
8 

8(12) 
24(126) 
30(57) 
10 

47(76) 
69 (81) 
40(44) 
35(127) 
15(96) 
11(46) 
53(191) 
41 (66) 
34 (58) 
19(28) 
15(35) 



20 

145(404) 
22(56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



19 
23(24) 
15 
5 

33(39) 
55(75) 
1 
2 
2 

2 
0 
0 
5 
0 
3 
7 

5(9) 

15(16) 
4(16) 
3 

17(19) 

1(2) 
0 

13(171) 
1(4) 
6 
2 
2 

10(46) 
2(4) 



9(47) 
45 (186) 
10(11) 

42(168) 
2 
1 
2 

4(7) 
9(62) 
18(42) 
6 

23(24) 
23(30) 
7(13) 
33(152) 
9(56) 
4(8) 
11(42) 
11(23) 
0 

6(11) 
3(7) 

.3'. 
72(269) 
11(38) 
32(44) 
21 (33) 
9 
34 

64(117) 
3 

4(10) 
15 



W 

20 
18(19) 
8 
5 

44(48) 
46(61) 

1(2) 
4 

.2(3) 

1 
1 
2 
7 
0 
2 
3 

2(3) 

12 
4(H) 
7(19) 
11(14) 
1 
0 

10(93) 
2(8) 
8 
2 
1 



174(384) 
3(6) 

43(67) 
54(157) 
6 

34(156) 
1 
2 
2 

6(10) 
11(65) 
14(26) 
4 

91 (132) 
7(9) 
3(6) 
27(113) 
7(22) 
1(2) 
8(45) 
18(47) 
17(19) 
2(5) 
9 

3 

75(223) 
3(H) 
24(35) 
15(20) 
5 
33 
41 (86) 
2 

3(16) 
7 



9 
3 
3 
5 
1 

23(27) 
0 
4 
1 

0 
. 0 
0 
0 
0 

1 

0 

1(2) 

9fH) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 

0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 
0 



2 

12(20) 
2(10) 
6(9) 
2(3) 
3 
20 
4(11) 
4 

2(5) 
13(14) 



8 
0 
6 
9 
3 
4 
0 
8 
0 

0 
0 
0 
0 
0 
5 
0 
0 



24 
6(16) 
0 
0 
0 
0 
0 
0 
0 
5 
0 

0 
0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 
0 



15 

66(111) 
25(67) 
66(90) 
22 
19 
93 

120 (328) 
0 

4(8) 
17 



0 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (J 30). Humans 
have at least 10 genes belonging to four 
different families involved in myelin produc- 

Table 18 [Continued) 



the Human genome 

tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis! 
Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Accession 
number , 



Domain name 



Domain description 



H 



F 



PF00254 


FKBP 


PFO1590 


CAF 


PF01344 


Kelch 


PFO0560 


LRR** 


PF00917 


MATH 


PF00989 


PAS 


PFO0S95 


PD2 


PF00169 


PH 


PF01535 


PPR** 


rrO053o 


SAM 


PF01369 


Sec7 


PF00017 


SH2 


PFO0O18 


SH3 


PF01740 


STAS 


PF00515 


TPR** 


PF00400 


WD40** 


PF00397 


WW 


rrU05o9 


ZZ 


PF01754 


Zf-A20 


PF01388 


ARID 


PF01426 


BAH 


PF00543 


Zf-B_box** 


PF00533 


BRCT 


PF00439 


Bromodomain 


PF00651 


BTB 


PF00145 


DNA_methylase 


PF00385 


Chromo 


PF00125 


Histone 


PF00134 


Cyclin 


PF00270 


DEAD 


PF01S29 


Zf-DHHC 


PF00646 


F-box** 


PF00250 


Forkjiead 


PF0O320 


GATA 


PF01585 


C-patch 


PF00010 


HLH** 


PF00850 




PF00046 


Homeobox 


PF01833 


TIC 


PF02373 


JmjC 


PF02375 


JmjN 


PF00013 


KH-domain 


PF01352 


KRAB 


PF00104 


Hormone_rec 


PF00412 


LIM 


PF00917 


MATH 


PF00249 


Myb_DNA-binding 


PF02344 


Myc-LZ 


PF01753 


Zf-MYND 


PF00628 


PHD 


PF00157 


Pou 


PF02257 


RFX_DNAJ>inding 


PF00076 


Rrm 


PF02037 


SAP 


PF00622 


SPRY 


PF018S2 


START 


PF00907 


T-box 



w 



FKBP-type peptidyl-prolyl ds-trans isomerases 

CAF domain 

Kelch motif 

Leucine Rich Repeat 

MATH domain 

PAS domain 

PDZ domain (Also Known as DHR or CLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin. CBP/p300 

Nuclear interaction domains 

A20-like zinc finger 
ARID DNA binding domain 
BAH domain 
B-box zinc finger 

BRCA1 C Terminus (BRCT) domain 

Bromodomain 

BTB/POZ domain 

C-5 cytosine-spedfic DNA methylase 
chromo' (CHRromatin Organization Modifier) 
domain 

Core histone H2A/H2B/H3/H4 
Cyclin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
CATA iinc finger 
C-patch domain 

Helix-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain 
IPT/TIC domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ligand-binding domain of nudear hormone 

receptor 
LIM domain containing proteins 
MATH domain 

Myb-like DNA-binding domain 
Myc leudne zipper domain 
MYND finger 
PHD-finger * 

Pou domain— N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.k.a. RRM, RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54(157) 
25(30) 
11 

18(19) 

96 (154) 
193(212) 

5 

29(31) 
13 

87(95) 
143(182) 
5 

72(131) 
136(305) 
32(53) 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28) 
37(48) 

97 (98) 
3(4) 

24(27) 



7(8) 
2(4) 
12(48) 
24(30) 
5 

9(10) 
60(87) 
72(78) 
3(4) 
15 
5 

33(39) 
55(75) 
1 

39(101) 
98(226) 
24(39) 
13 

2 
6 

7(8) 
1 

10(18) 
16(22) 
62 (64) 
1 

14(15) 



75(81) 


5 


19 


10 


63 (66) 


48(50) 


15 


20 


16 


15 


35(36) 


20(21) 


11(17) 


5(6) 


18 


16 


60(61) 


44 


12 


5(6) 


160(178) 


100(103) 


29(53) 


11(13) 


10 


4 


7 


4 


28(67) 


14(32) 


204(243) 


0 


47 


17 



62 (129) 
11 

32(43) 
1 
14 
68 (86) 
15 
7 

224(324) 
15 

44(51) 
10 
17(19) 



33(83) 
5 

18(24) 
0 
14 

40(53) 
5 
2 

127(199) 



7(13) 
1 

13(41) 
7(11) 
88(161) 
6 

46(66) 
65(68) 
0 
8 
5 

44(48) 
46(61) 
6 

28(54) 
72(153) 
16(24) 
10 

2 
4 

4(5) 
2 

23(35) 
18(26) 
86(91) 
0 

17(18) 

71 (73) 
10 

55(57) 
16 

309 (324) 
15 
8(10) 
13 
24 
8(10) 
82 (84) 
5(7) 
6 
2 

17(46) 
0 

142(147) 

33(79) 
88(161) 
17(24) 
0 
9 

32 (44) 
4 
1 

94(145) 



8 5 

10(12) 5(7) 

2 6 

8 22 



v 
Y 


A 


4 


24(29) 


0 


10 


3 


102(178) 


1 


15(16) 


i 


61 (74) 


1 


13(18) 


2 


5 


24 


23 


1 


474(2485) 


3 


6 


5 


9 


1 


3 


23(27) 


4 


2 


13 


16(31) 
56(121) 


65(124) 
167(344) 


5(8) 


11(15) 


2 


10 



0 
2 
5 
0 

10(16) 
10(15) 
■ 1(2) 
0 

K2) 

8 
11 

50(52) 
7 
9 
4 
9 
4 
4 
5 
6 
2 
4 
3 

4(14) 
0 
0 

4(7) 
1 

15(20) 
0 
1 

14(15) 
0 
1 

43(73) 

5 
3 
0 
0 



8 
7 

21 (25) 
0 

12(16) 
28 
30(31) 
13(15) 
12 

48 
35 
84(87) 
22 

165(167) 
0 
26 
14(15) 
39 
10 
66 
1 
7 
7 

27(61) 
0 
0 

10(16) 
61 (74) 
243 (401) 
0 
7 

96(105) 
0 
0 

232 (369) 

6(7) 
6 
23 
0 
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Table 18 (Continued) 



The Human Genome 



Accession 
number 



Domain name 



PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



Zf-TAZ 
TEA 

Zf-TRAF 
TBP 

TUDOR 

Zf-CCCH 

Zf-C2H2** 

Zf-C3HC4 

Zf-CCHC 



Domain description 

TAZ finger 
TEA domain 
TRAF-type zinc finger 

Transcription factor TFIID (or TATA-binding 

protein, TBP) 
TUDOR domain 

Zinc finger C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



2(3) 
4 

6(9) 
2(4) 

9(24) 
17(22) 
564(4500) 
135(137) 
9(17) 



1(2) 

1(3) 
4(8) 

9(19) 
6(8) 
234(771) 
57 
6(10) 



6(7) 
1 

2(4) 

4(5) 
22(42) 
68(155) 
88(89) 
17(33) 



0 
1 
0 

1(2) 
0 

3(5) 
34(56) 
18 
7(13) 



10(15) 
0 
2 

2(4) 
2 

31 (46) 
21 (24) 
298(304) 
68(91) 



(Tables 18- and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-0 (TGF-0), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 12 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in . 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(13 J). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (132), we observe an expan- 
. sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (133). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture: Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predorninant 
expression in neuronal, muscle, and vascular 
tissues. 



Comparison across the five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in phosphotyrosine signal transduction. Fur- 
ther, there is a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 

The downstream effectors of the intracellu- 
lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
binding nuclear hormone receptor class of tran- 
scription factors compared with the fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
. mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



12 
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homeodomains alone or in combination with 
Pou and LM domains in all of the anim al 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VP1 
and AP2 domain^ntaining proteins (134). 
The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation! 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it. should be noted that 
most of the protein domains are highly con- 
served. An interesting observation, is that 
worms and humans have approximately the 
same number of both tyrosine kinases and 
senne/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
wide repertoire of interaction domains with 
significant combinatorial diversity. 

Hemostasis. Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothehum and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19), We note the evolu- 
tion of domains such as FTMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there , has been extensive re- 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 



significant expansion in two femilies of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metallo- 
proteases) (Table 19). Proteolysis of extracel- 
hilar matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of inflammatory conditions 
{135, 136). ADAMs are a family of integral 
membrane proteins with a pivotal role in fibrin- 
ogenolysis and modulating interactions be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-ot, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS femilies. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
regulatory enzymes (137). We enumerated 
the protein counts of central adaptor and ef- 
fector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18) 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and BcI2 are represent- 

* o .^ fly 3nd WOnn ( althou Sh the number 
or Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules 
namely the para- and meta-caspases, have' 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain-containing proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to me vertebrates and plants, whereas the lip- 
oxygenase-activating proteins (four in humans) 
may be vertebrate-specific. Lipoxygenases are 
involved in arachidonic acid metabolism, and 
they and their activators have been implicated 
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alytic activity, as a uracil DNA glycosylase 
{140) and functions as a cell cycle regulator 
(141) and has even been implicated in apo- 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

Table 19 (Continued) 
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may account for many of these expansions 
[see the discussion above and (143)]. Recent 
evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in humans) have been 
shown to induce apoptosis (144). 

There is also a four- to fivefold expansion 
in the elongation factor l-alpha family 
(eEFIA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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transposition, and again there is evidence that 
many of these may be pseudogenes (145) 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFIA (146). 

Ribonucleoproteins. Alternative splicing 
results in .multiple transcripts from a single 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the 265 identified in the 
Arabidopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein (148). 
Tyrosylprotein sulfotransferases participate 
in the posttranslational modification of pro- 
terns involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the prominent differences in 
the immune system, hemostasis, neuronal 
vascular, and cytoskeletal complexity. The 
rinding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (ISO). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that we observe in humans. Perhaps 
the best illustration of this trend is the C2H2 
zinc fingewontaining transcription factors 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal nbosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full 
extent of this process in the human genome 
(7 At the posttranslational level, although 
we provide examples of expansions of some 
protein families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 

8 Conclusions 
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8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (15, 80, 252) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (m terms of the order of the markers) is 
more important than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequenc- 
ing, the prior existence of mapping data was 
beneficial. During the sequencing of the A 
™?f™a genome, sequencing of individual 
BAC clones permitted extension of the se- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of BAC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worm exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data. 
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8.2 The low gene number in humans 

We have sequenced and assembled -95% of. 
the euchromaric sequence of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5' -untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(755); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might, have to pay a 
price forthe number of genes it can possibly 
cany. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot, maintain itself. On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
{154), calculated that the mammalian ge- 
nome would contain a maximum of not much 
more than 30,000 genes (J 55). An estimate of 
30,000 gene loci for humans was also arrived 
at by Crow and Kimura (755). Midler's esti- 
mate for A melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
the fly genome (26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all .genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

The modest number of human genes 
means that we must look elsewhere for the 
mechanisms that generate the complexities 
inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure 
and hence transcriptional activity is regulated 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (757); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery, consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (16(f) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance (76*7). Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and apoptosis (162). At the protein level 
minor alterations m tie nature of protein- 
protein interactions, protein modifications 
and localization can have dramatic effects on 
cellular physiology (163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

In situ studies have shown that the human 
genome is asymmetrically populated with 
G+C content, CpG islands, and genes (68). 
However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
(69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -4094. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate, genome (77). Why 
are there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
mans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is -70% that of humans. 



8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 



types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modem human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, and admix-, 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (265). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
population as there are autosomal chromo- 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (J 66). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
to, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 

8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 
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then docks on' this,' and then the complex 
moves there. . . « ( I67) t0 the exciting area 
of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other "parts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number 
nor number of cell types correlates in any' 
meaningful manner with even, simplistic mea- 
sures of stnictural.or behavioral complexity 
Nor would they be expected to; this is the realm 
of nonlinearities and epigenesis (168). The 520 

million neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 
human, and from comparative mammalian neu- 
roanatomy (169), that the morphological and 
behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to .a 
chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies,pf all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
ganizations, and cell types and neuroanatomies 
are almost mdistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-0, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8 5 Beyond single components 

While few would disagree with the intuitive 
conclusion that Einstein's brain was more 
complex than that of Drosophili, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein 
protein domain, or protein-protein interaction 
measures do not capture context-dependent 
interactions that underpin, the dynamics un- 
derlying pheno^ype. 

Currently, there are more than 30 different 
mathematical descriptions of complexity (170) 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (171). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be Particularly *>bust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price* they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene knockouts provide an 
ilhistration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
mtermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
typic effects (772), and yet the usually conspic- 
uous vimentin network is completely absent 
On the other hand, -30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background Thus, there are 
no "good" genes or "bad n genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity," particularly because 
deconvoluting and correcting complex net- 
works that have undergone furtigation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
and exciting journey toward understanding 
the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- 
notation. The next steps are clear: We must 
define the complexity that ensues when this 
relati vely modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
public discussion of this information and its. 
potential for improvement of personal health. 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are "hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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defined similarity to each other, only that they 



vvwsciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1349 



The human 6 'i n o m e 



share at least one significant BLAST hit in common. 
This is an especially interesting property of the 
metric, because it allows the rapid recovery of pro- 
tein families from the proteome for which no mul- 
tiple alignment is possible, thus providing a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whole-proteome 
similarity matrix has been calculated, Lek first par- 
titions the proteome into single-linkage dusters 
(27) on the basis of one or more shared BLAST hits 
between two sequences. Next these single-linkage 
dusters are further partitioned into sub dusters, 
each member of which shares a user-spedfied pair- 
wise similarity with the other members of the dus- . 
. ter, as described above. For the purposes of this 
publication, we have focused on the analysis of 
single-linkage dusters and what we have termed 
"complete dusters" e.g., those subdusters .for 
which everymember has a similarity metric of 1 to 
every other member of the subduster. We believe 
that the single-linkage and complete dusters are of 
spedal interest, in part, because they allow us to 
estimate and to compare sizes of core protein sets 
in a rigorous manner. The rationale for this is as 
follows: If one imagines for a moment a perfect 
clustering algorithm capable of perfectly partition- 
ing one or more perfectly annotated protein sets 
into protein families, it is reasonable to assume that 
the number of dusters will always be greater than, 
or equal to, the number of single-linkage dusters, 
because single-linkage clustering is a maximally ag- 
glomerative clustering method. Thus, if there exists 
a single protein in the predicted protein set contain- 
ing domains A and B, then ft will be dustered by 
single linkage together with all single-domain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single muttido- 
main protein, the number of real dusters must 
always be less than or equal to the number of 
complete dusters, because it is impossible to place 
a unique multidomain protein into a complete clus- 
ter. Thus, the single-linkage and complete dusters 
. plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisms 1 predicted protein set 
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; A historic 
moment for 
the scientific 
endeavor. 



EXHIBIT M 

THE HUMAN 
GENOME 

umanity has been 'given a great gift With the completion' of the human, 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. _ 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venterlof Celera • 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Collins appears 
in this week's Nature. This stunning achievement has been portrayed— 
often unfairly— as a competition between two 
ventures, one public and one private. That characterization detracts from 
the awesome accomplishment jointly unveiled this week. In truth, each 
project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those who believe that the pursuit of large-scale funda- 
mental problems in the life sciences is in the national interest The technical 
innovation and drive of Craig Venter and his colleagues made it possible- 
to celebrate this accomplishment far sooner than was believed possible. 
Thus we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between public funding and 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that • 
has given us two winners. Two sequences are better than one; the opportunity for ~mpamon and con- 
. vergfnceis invaluable. Indeed, a real-world proof of the *9^«^^p«"* : 
be found in the pages of this issue of Science, in the comparative analysis by Olivier et <P; «^ 

Althouch wehave made the point before, it is worth repeating that the sequencing of the bunian 
genome represents, not an ending, but the beginning of a new approach to biology.- As Galas gvs m 
his Viewpoint (p. 257), the knowledge that all of the genetic components of any P««««gj* 
SdnuTiTwill rive extraordinary new power to scientists. Because of this breakthrough, research 
^ZStehg thelffects of individual genes to a more integrated View that examines 
whole e^emblTsSes as they interact to form a living human being. Several articles in this issue 
• ^SSSSSSXlg^ is already beginning to revolutionize the way we look at human , diseas? 
This has been a missive project, on a scale unparalleled in the history of biology ; but °f cour* 
it has built on the scientific insights of centuries ^^f^J^S^S^^ 
announcement falls during the week of the anniversary of the birth of Charles «™J 
messaee that the survival of a species can depend on its ability to evolve in the face of change is 
3y pertken^discussionl that have gone on in the past year over access to *e Celera ^ 
ffiffl bformation regarding the agreements that were reached to make the data available can be 

found* wwwsta W ! ™ ^ 10 be ; 

S^SSoSef other than the traditional GenBank, while insisting on access to all the . 

are rroducing more and more potentially valuable sequences, yet (at least in 
KoSg^tabases provide scant protection against piracy. Had tie Celera data been kept se- 
cret ft would tarn been a *rious loss to the scientific community. We hope that our «M^yin 
mE$Sw» will enable other proprietary data to be pubUshed after peer review, in a way that 

"fSSL ^SSSTJS!^ stunning, and so carefully watched h* ^created • 
new chalUnges for the scientific venture. Science is.proud to have played a role tn hmft» 
Sovery onto the public stage. It is literally true that this is a historic momen for the scientificjn- 
tr^unTi'genome has been called the Book of Life. Rather, it is *^£***Z* 
rules that encourage exploration and reward creativity, we can find many of the books that will 
help define us and our place inthe great tapestry of life. - an<J ^ Kenne ^ 




EXHIBIT *N ' < 

BLAST of SEQ ID NO:23 versus Human genome 



MEGABLAST 1 , 2 . 3-Paracel [2001-11-20] 
Reference : 

Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), 
"A greedy algorithm for aligning DNA sequences", 
J Comput Biol 2000; 7 { 1-2 ): 203-14 . 
Database: Homo_sapiens . latestgp .masked. fa 

" 44, 521 sequences; 2 00,768,834,160 total letters 
Query= seqid23 

(3660 letters) 

Score E 

Sequences producing significant alignments: (bits) Value 

AC007600. 5.1.183083 / 505 ■ e-140 

AC096996. 1.1. 194627 454 e-124 

>AC007600. 5.1.183083 

Length = 183083 

Score = 505 bits (255), Expect = e-140 
Identities = 255/255 (100%) 
Strand = Plus / Plus 

Query 1355 agaagtttttcctccaggagagccctgttttctatgtccagacattacaagaccccagca 1414 

1 1 [ i I ] 1 j ] I [ I [ I ! ! 1 1 1 1 1 1 1 ! 1 E I J [ 1 1 1 1 i J [ 1 1 i j ! J J I i 1 1 1 1 1 1 1 : 1 1 i 1 1 i i 

Sbjct : 46566 agaagtttttcctccaggagagccctgttttctatgtccagacattacaagaccccagca 46625 
Query* 1415 aagctctggtctttgaggaggccaccttgtcatggcaacagacctgtcccgggatcgtca 1474 

1 1 1 1 i 1 1 ! 1 1 1 1 1 1 M 1 1 ! I M 1 1 1 M ! I i 1 1 i 1 1 ' 1 1 ! I! 1 1 1 ! 1 M M 1 1 1 1 1 M M I 

Sbjct : 46626 aagctctggtctttgaggaggccaccttgtcatggcaacagacctgtcccgggatcgtca 46685 
Query: 1475 atggggcactggagctggagaggaacgggcatgcttctgaggggatgaccaggcctagag 1534 

IIMIIinillllllllllMMIIIIIIIIIMIIIIIIIMIIIIIIIIIIIIIIII 

Sbjct: 46686 atggggcactggagctggagaggaacgggcatgcttctgaggggatgaccaggcctagag 46745 
Query: 1535 atgccctcgggccagaggaagaagggaacagcctgggcccagagttgcacaagatcaacc 1594 

i i 1 1 1 1 1 1 1 1 i 1 1 1 1 1 i i ; 1 1 1 m 1 1 1 1 i 1 1 ! 1 1 : 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 e 1 1 1 1 

Sbjct: 46746 atgccctcgggccagaggaagaagggaacagcctgggcccagagttgcacaagatcaacc 46805 

Query: 1595' tggtggtgtccaagg 1609 

111111111111111 
Sbjct: 46806 tggtggtgtccaagg 46820 



Score = 458 bits (231), Expect = e-125 
Identities = 234/235 (99%) 
Strand = Plus / Plus 

Query 544 atattgattataccaaagatcctggaatattcagaagagcagttggggaatgttgtccat 603 

IIIIIMIIIIIIIMIIIMMMIMIMIIIIIIIIIMIIMII lllllllllll: _ 

Sbjct: 34936 atattgattataccaaagatcctggaatattcagaagagcagttggggaatgttgtccat 34995 
Query • 604 ggagtgggactctgctttgccctttttctctccgaatgtgtgaagtctctgagtttctcc 663 

llllllll II MM 1 1 1 1 1 II II M I IIIM.IIIM lillill II ! II MM M I lii 



Sbjct: 34996 ggagtgggactctgctttgccctttttctctccgaatgtgtgaagtctctgagtttctcc 35055 



Query: 664 tccagttggatcatcaaccaacgcacagccatcaggttccaagcagctgtttcctccttt 723 

1 1 ] I II 1 1 1 1 1 ! [ 1 1 1 M M I i M I I M 1 1 1 i 1 i I Mill IIMMIIIIIII 

Sbjct: 35056 tccagttggatcatcaaccaacgcacagccatcaggttccgagcagctgtttcctccttt 35115 
Query: 724 gcctttgagaagctcatccaatttaagtctgtaatacacatcacctcaggagagg 778 

1 1 ; [ ! I [ j I i 1 1 1 1 E I M 1 1 1 1 1 i 1 1 E M 1 1 1 1 1 1 M 1 1 M 1 1 1 1 1 1 1 1 1 1 f 1 1 1 

Sbjct: 35116 gcctttgagaagctcatccaatttaagtctgtaatacacatcacctcaggagagg 35170 



Score = 454 bits (229), Expect = e-124 
Identities = 229/229 (100%) 
Strand = Plus / Plus 

Query: 2217 ggttttccgctgccccatgagtttctttgacaccatcccaataggccggcttttgaactg 2276 

MIMM MM IMMMMIIIMMIMI IMIMMMMMIMIIIMIMIII 

Sbjct: 70337 ggttttccgctgccccatgagtt-tctttgacaccatcccaataggccggcttttgaactg 70396 
Query: 2277 cttcgcaggggacttggaacagctggaccagctcttgcccat'cttttcagagcagttcct 2336 

lllilMIMIIIMIIIIMIMI IMIIIMIIIIIIIII II IIIMH MMIIMI 

Sbjct: 703 97 cttcgcaggggacttggaacagctggaccagctcttgcccatcttttcagagcagttcct 70456 
Query: 2337 ggtcctgtccttaatggtgatcgccgtcctgttgattgtcagtgtgctgtctccatatat 2396 

IIMIIIIIIIIMIIIIIIIIIII IIIIIMIIIIIIIIIIIIMIMIIIIIIIIIII 

Sbjct: 70457 ggtcctgtccttaatggtgatcgccgtcctgttgattgtcagtgtgctgtctccatatat 70516 
Query: 2397 cctgttaatgggagccataatcatggttatttgcttcatttattatatg 2445 

llllllllllillllllMIIIMIIIIIIIIIIIIMIIIIIIIIII 

Sbjct: 70517 cctgttaatgggagccataatcatggttatttgcttcatttattatatg 70565 



Score =. 408 bits (206), Expect = e-110 
Identities = 206/206 (100%) 
Strand = Plus / Plus 

Query: 1877 agattggagagcggggcctcaacctctctggggggcagaaacagaggatcagcctggccc 1936 

illlMlllllllllliMIIMIIIIIIIIMIIIM MM IMIIIIII IMIMIM 

Sbjct: 57285, agattggagagcggggcctcaacctctctggggggcagaaacagaggatcagcctggccc 57344 
Query* 1937 gcgccgtctattccgaccgtcagatctacctgctggacgaccccctgtctgctgtggacg 1996 

. i r 1 1 1 1 1 1 1 1 1 1 j ii 1 1 1 1 j i i 1 1 1 r 1 1 1 1 1 f 1 1 1 1 1 i e 1 1 1 iiiii 1 1 1 i 1 1 1 1 1 1 1 ii 

Sbjct: 57345 gcgccgtctattccgaccgtcagatctacctgctggacgaccccctgtctgctgtggacg 57404 
Query: 1997 cccacgtggggaagcacatttttgaggagtgcattaagaagacactcagggggaagacgg 2056 

1 1 1 1 1 1 1 I M 1 1 1 1 i 1 1 1 1 i I ! 1 1 1 1 i 1 1 1 1 ! 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 

Sbjct; 57405 cccacgtggggaagcacatttttgaggagtgcattaagaagacactcagggggaagacgg 57464 
Query: 2057 tcgtcctggtgacccaccagctgcag 2082 

MUM Mill II II IMIMIM 

Sbjct: 57465 tcgtcctggtgacccaccagctgcag 57490 



Score = 385 bits (194), Expect = e-103 
Identities = 194/194 (100%) 



Strand = Plus / Plus 
Query: 2857 aagatgtgtgtctcggaagctcctttacacatggaaggcacaagttgtccccaggggtgg 2916 

MIIHII MM Mill MM Mill II II MMMM II III MM II II I MMMM 

Sbjct: 80649 aagatgtgtgtctcggaagctcctttacacatggaaggcacaagttgtccccaggggtgg 80708 
Query 2917 ccacagcatggggaaatcatatttcaggattatcacatgaaatacagagacaacacaccc 2976 

MMMMMMMMMMMMMMMMMMM MMMMMMM MMMM 

Sbjct: 80709 ccacagcatggggaaatcatatttcaggattatcacatgaaatacagagacaacacaccc 80768 
Query* 2977 accgtgcttcacggcatcaacctgaccatccgcggccacgaagtggtgggcatcgtggga 3036 

!IIIIIIIIIIIMIIIIIIIIIIMIIIMIIMIIMIilililMIIM!!!l!l!l 

Sbjct: 80769 accgtgcttcacggcatcaacctgaccatccgcggccacgaagtggtgggcatcgtggga 80828 
Query: 3037 aggacgggctctgg 3050 

MMMMMMM 

Sbjct: 80829 aggacgggctctgg 80842 



Score = 371 bits (187), Expect = 2e-99 
Identities = 187/187 (100%) 
Strand = Plus / Plus 

Query: 2583 gtttaagaggctgactgatgcgcagaataactacctgctgttgtttctatcttccacacg 2642 

MMMMMMMMMMMMMMMMMMM] MMMM MMMMMMI 

Sbjct: 73139 gtttaagaggctgactgatgcgcagaataactacctgctgttgtttctatcttccacacg 73198 
Query: 2643 atggatggcattgaggctggagatcatgaccaaccttgtgaccttggctgttgccctgtt 2702 

I M 1 1 II 1 1 1 1 1 1 1 II M 1 1 1 1 1 1 1 M 1 1 1 1 II I M 1 1 1 1 1 1 M 1 1 1 M I M 1 1 1 1 II 1 1 

Sbjct: 73199 atggatggcattgaggctggagatcatgaccaaccttgtgaccttggctgttgccctgtt 73258. 
Query 2703 cgtggcttttggcatttcctccaccccctactcctttaaagtcatggctgtcaacatcgt 2762 

MMMIMMMMMMMl MMMMMMMMMMMMMMMMMMM 

Sbjct: 73259 cgtggcttttggcatttcctccaccccctactcctttaaagtcatggctgtcaacatcgt 73318 

Query: 2763 gctgcag 2769 
I I I I I I I 

Sbjct: 73319 gctgcag 73325 



Score = 347 bits (175), Expect = 3e-92 
Identities = 178/179 (99%) 
Strand = Plus / Plus 

Query 776 aggccatcagcttcttcaccggtgatgtaaactacctgtttgaaggggtgtgctatggac 835 

MMMMMMMMMMI MMMMMMMMM MMIMM MMMMMM 

Sbjct: 41478 aggccatcagcttcttcaccggtgatgtaaactacctgtttgaaggggtgtgctatggac 41537 
Query: 836 ccctagtactgatcacctgcgcatcgctggtcatctgcagcatttcttcctacttcatta 895 

M 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 M 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 41538 ccctagtactgatcacctgcgcatcgctggtcatctgcagcatttcttcctacttcatta 41597 
Query 896 ttggatacactgcatttattgccatcttatgctatctcctggttttcccactggagg.ta 954 

MMMMMMMM IMMMIMMMMMMMMMMMMMMI MM 

Sbjct: 41598 ttggatacactgcatttattgccatcttatgctatctcctggttttcccactggcggta 41656 



Score = 337 bits (170), Expect = 3e-89 
Identities = 170/170 (100%) 
Strand = Plus / Plus 

Query: 3401 agatcatccttatcgatgaagccacagcctccattgacatggagacagacaccctgatcc 3460 

MM III IIIIIIIMIIIIII.il I MM I Ml I II I Mill 1 1 II I II II II I Mill- 

Sbjct: 90103 agatcatccttatcgatgaagccacagcctccattgacatggagacagacaccctgatcc 90162 
Query: 3461 agcgcacaatccgtgaagccttccagggctgcaccgtgctcgtcattgcccaccgtgtca 3520 

II M I II f 1 1 ! I M 1 1 1 1 1 II M I f [ I M 1 1 1 1 1 1 J 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 

Sbjct: 90163 agcgcacaatccgtgaagccttccagggctgcaccgtgctcgtcattgcccaccgtgtca 90222 
Query: 3 521 ccactgtgctgaactgtgaccacatcctggttatgggcaatgggaaggtg 3 570 . 

II M MM MM 1 1 IIMIM MIIMIM MM MIIMIM Mill M 

Sbjct: 90223 ccactgtgctgaactgtgaccacatcctggttatgggcaatgggaaggtg 90272 



Score = 321 bits (162), Expect = 2e-84 
Identities = 162/162 (100%) 
Strand = Plus / Plus 

Query: 23 5 aggtttcctgccccccagcccctggacaatgctggcctgttctcctacctcaccgtgtca 294 

III! Mill MMMIIIIM MM MIIIIMIIIIIMIII MM ill M II M MM 

Sbjct: 29801 aggtttcctgccccccagcccctggacaatgctggcctgttctcctacctcaccgtgtca 29860 
Query: 295 tggctcaccccgctcatgatccaaagcttacggagtcgcttagatgagaacaccatccct 354 

M I M M I M 1 1 1 M I M 1 1 M 1 1 1 1 1 1 ii i M M i M ii M i M 1 1 1 M 1 1 1 1 1 1 1 1 1 1 

Sbjct: 29861 tggctcaccccgctcatgatccaaagcttacggagtcgcttagatgagaacaccatccct 29920 
Query: 355 ccactgtcagtccatgatgcctcagacaaaaatgtccaaagg 396 * 

MM Mill MM I III I MUM II II IIMIIMIMM! 

Sbjct: 29921 ccactgtcagtccatgatgcctcagacaaaaatgtccaaagg 29962 



Score = 319 bits (161), Expect = 7e-84 
Identities = 161/161 (100%) 
Strand = Plus / Plus 

Query: 3049 gggaagtcctccttgggcatggctctcttccgcctggtggagcccatggcaggccggatt 3108 - 

1 1 1 1 j 1 1 1 ; 1 1 r i j 1 1 1 1 1 1 1 1 1 1 1 j 1 1 1 1 1 1 1 r 1 1 j i i 1 1 r 1 1 i 1 1 1 1 1 r 1 1 r 1 1 1 1 1 1 

Sbjct: 82347 gggaagtcctccttgggcatggctctcttccgcctggtggagcccatggcaggccggatt 82406 
Query: 3109 ctcattgacggcgtggacatttgcagcatcggcctggaggacttgcggtccaagctctca 3168 

1 1 1 1 1 1 M I J M I M 1 1 M 1 1 1 M 1 1 1 M 1 1 ! M M 1 1 M M M M 1 1 J I M i 1 1 1 1 1 1 1 

Sbjct: 82407 ctcattgacggcgtggacatttgcagcatcggcctggaggacttgcggtccaagctctca 82466 
Query: 3169 gtgatccctcaagatccagtgctgctctcaggaaccatcag 3209 

Mill IIMIMIMI IMIIMI I M MIMIIIIIMM 

Sbjct: 82467 gtgatccctcaagatccagtgctgctctcaggaaccatcag 82507 



Score = 299 bits (151), Expect = 6e-78 
Identities = 151/151 (100%) 



Strand = Plus / Plus 



Query: 393 aaggcttcaccgcctttgggaagaagaagtctcaaggcgagggattgaaaaagcttcagt 452 

IIIIIMIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIII 

Sbjct: 33335 aaggcttcaccgcctttgggaagaagaagtctcaaggcgagggattgaaaaagcttcagt 33394 
Query: 453 gcttctggtgatgctgaggttccagagaacaaggttgattttcgatgcacttctgggcat 512 

1 1 1 1 1 1 1 1 1 M M 1 1 1 1 ii 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 33395 gcttctggtgatgctgaggttccagagaacaaggttgattttcgatgcacttctgggcat 33454 . 



Query: 513 ctgcttctgcattgccagtgtactcgggcca 543 

IIIIIIIIIIIIIIIIIIIIIIMIIMIII 

Sbjct: 33455 ctgcttctgcattgccagtgtactcgggcca 33485 



Score = 297 bits (150), Expect = 3e-77 
Identities = 150/150 (100%) 
Strand = Plus / Plus 



Query: 950 aggtattcatgacaagaatggctgtgaaggctcagcatcacacatctgaggtcagcgacc 1009 

II IMMIMIMIM Ml M II M MMMMIIIIiMMM 1 1 1 1 II MUM M 

Sbjct: 42421 aggtattcatgacaagaatggctgtgaaggctcagcatcacacatctgaggtcagcgacc 42480 
Query: 1010 agcgcatccgtgtgaccagtgaagttctcacttgcatt'aagctgattaaaatgtacacat 1069 

i f i i 1 1 1 1 1 1 1 i i r 1 1 1 i 1 1 1 1 1 1 1 1 1 ; i m 1 1 1 j i j i : ; 1 1 i [ 1 1 i 1 1 1 1 1 1 1 1 i 1 1 1 1 

Sbjct: 42481 agcgcatccgtgtgaccagtgaagttctcacttgcattaagctgattaaaatgtacacat 42540 
Query: 1070 gggagaaaccatttgcaaaaatcattgaag 1099 

1 1 M 1 1 1 1 1 1 1 1 1 1 1 II II 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 42541 gggagaaaccatttgcaaaaatcattgaag 42570 



Score = 293 bits (148), Expect = 4e-76 
Identities = 151/152 (99%) 
Strand = Plus / Plus 



Query: 1098 agacctaagaaggaaggaaaggaagctattggagaagtgcgggcttgtccagagcctgac 1157 

llllllllllllllllllllllll 1 1 i 1 1 1 1 i 1 1 1 ! I i 1 1 1 1 1 1 1 1 i i I f I i 1 1 1 1 i 1 1 . 

Sbjct: 42736 agacctaagaaggaaggaaaggaaactattggagaagtgcgggcttgtccagagcctgac 42795 



Query: 1158 ■ aagtataaccttgttcatcatccccacagtggccacagcggtctgggttctcatccacac 1217- 

MIIIIIIIIIMM MIIIIMIMI MllllMIIIIMIIIIIII INI MIIIMI \. 

Sbjct: 42796 aagtataaccttgttcatcatccccacagtggccacagcggtctgggttctcatccacac 42855 
Query: 1218 atccttaaagctgaaactcacagcgtcaatgg 1249 

III I MMMI III! MM ill! MM MMI 

Sbjct: 42856 atccttaaagctgaaactcacagcgtcaatgg 42887 



Score = 289 bits (146), Expect = 6e-75 
Identities = 188/202 (93%) 
Strand = Plus / Plus 



Query: 1877 agattggagagcggggcctcaacctctctggggggcagaaacagaggatcagcctggccc 1936 

IMIMI 1 1 J 1 1 1 1 1 1 1 1 1 1 [ I i J 1 1 1 1 1 1 i I ! 1 1 1 1 1 llllllll IIMIIIIII 



Sbjct: 142145 agattggggagcggggcctcaacctctctggggggcagaggcagaggattagcctggccc 142204 
Query: 1937 gcg.ccgtctattccgaccgtcagatctacctgctggacgaccccctgtctgctgtggacg 1996 

1 1 1 1 Mill llllllilllll IIIMIIIIIIIMMMIIIIIII II lllllll 

Sbjct : 142205 gcgctgtctactccgaccgtcagctctacctgctggacgaccccctgtcggccgtggacg 142264 
Query: 1997 cccacgtggggaagcacatttttgaggagtgcattaagaagacactcagggggaagacgg 2056 

II MIIMIIIMIMI I MM IIMIIIMIIIMIIMI IMIIIII Mill I 

Sbjct: 142265 cccacgtggggaagcacgtctttgaggagtgcattaagaagacgctcaggggaaagacag 142324. 
Query: 2 057 tcatcctggtgacccaccagct 207 8 

M IIIIIMMMMMIMM 

Sbjct: 142325 tcgtcctggtgacccaccagct 142346 



Score = 278 bits (140), Expect = 2e-71 
Identities = 140/140 (100%) 
Strand = Plus / Plus 

Query: 2445 gatgttcaagaaggccatcggtgtgttcaagagactggagaactatagccggtctccttt 2 504 $ 

MIIIIMMIIIIIMIIIIMIIIIIIIII IMIIIMMMIIIIMIIIII Ml I 

Sbjct: 70675 gatgttcaagaaggccatcggtgtgttcaagagactggagaactatagccggtctccttt 70734 
Query: 2505 attctcccacatcctcaattctctgcaaggcctgagctccatccatgtctatggaaaaac 2564 

1 1 1 M 1 1 M 1 1 ( I ! 1 1 1 1 i M 1 1 1 1 1 1 1 M 1 1 M 1 1 1 II 1 1 1 M 1 1 1 II i M 1 1 1 1 1 1 1 ! 

Sbjct: 70735 attctcccacatcctcaattctctgcaaggcctgagctccatccatgtctatggaaaaac 70794 
Query: 2565 tgaagacttcatcagccagt 2584 

1 1 II II II II Ml II III II 

Sbjct: 70795 tgaagacttcatcagccagt 70814 



Score = 278 bits (140) , "Expect = 2e-71 
Identities = 140/140 (100%) 
Strand = Plus / Plus 

Query: 2080 cagtacttagaattttgtggccagatcattttgttggaaaatgggaaaatctgtgaaaat 2139 

MIMMIIMM Mil MM MMIMMMMMMIIIMMIMMMIMI ill! 

Sbjct: 59486 cagtacttagaattttgtggccagatcattttgttggaaaatgggaaaatctgtgaaaat 59545 
Query: 2140. ggaactcacagtgagttaatgcagaaaaaggggaaatatgcccaacttatccagaagatg 21S9 

. M M 1 1 1 1 M 1 1 1 II I M I M 1 1 1 1 M 1 1 1 M 1 1 1 1 1 1 M 1 1 M 1 1 1 1 1 1 1 1 1 M 1 1 1 II , 

Sbjct: 59546 ggaactcacagtgagttaatgcagaaaaaggggaaatatgcccaacttatccagaagatg 59605 
Query: 2200 cacaaggaagccacttcggt 22 L9 

1 1 m 1 1 1 r 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 59606 cacaaggaagccacttcggt 59625 



Score = 276 bits (139), Expect = 9e-71 
Identities = 139/139 (100%) 
Strand = Plus / Plus 



Query: 100 aaaacctatactctccaagatggcccctggagtcagcaagagagaaatcctgaggctcca 159 

1 1 1 1 1 1 1 1 1 M I M 1 1 M I M 1 1 1 1 1 M M 1 1 1 II I M I M I M M 1 1 M M 1 1 M 1 1 1 1 



Sbjct: 27194 aaaacctatactctccaagatggcccctggagtcagcaagagagaaatcctgaggctcca 27253 



Query: 160 gggagggcagctgtcccaccgtgggggaagtatgatgctgccttgagaaccatgattccc 219 

MIMIIMIMMIM UMMMIIMIMMIIMIMIMM MIMiMM 

Sbjct: 27254 gggagggcagctgtcccaccgtgggggaagtatgatgctgccttgagaaccatgattccc 27313 
Query: 220 ttccgtcccaagccgaggt 238 

iiiiiiiiiMiiniiii. 

Sbjct: -27314 ttccgtcccaagccgaggt 27332 



Score = 252 bits (127), Expect = le-63 
Identities = 127/127 (100%) 
Strand = Plus / Plus 



Query: 1679 agatgcacttgctcgagggctcggtgggggtgcagggaagcctggcctatgtcccccagc 1738 . 

IIMIIMIIIIIIIMIMIIIIIMIIMMII IIIIIMMMII II II MM III 

Sbjct: 5222 8 agatgcacttgctcgagggctcggtgggggtgcagggaagcctggcctatgtcccccagc 52287 



Query: 1739 aggcctggatcgtcagcgggaacatcagggagaacatcctcatgggaggcgcatatgaca 1798 

IMIMIIIIIIMMMIIIIIIMIMIIIIIIIIIIIIIMMII II II MM Ml 

Sbjct: 52288 aggcctggatcgtcagcgggaacatcagggagaacatcctcatgggaggcgcatatgaca 52347 
Query: 1799 aggcccg 1805 

MMMI 

Sbjct: 52348 aggcccg 52354 

Score = 226 bits (114), Expect = 8e-56 
Identities = 114/114 (100%) 
Strand = Plus / Plus 



Query: 3289 atctcaaagttccccaaaaagctgcatacagatgtggtggaaaacggtggaaacttctct 3348 

1 1 1 1 f J I i ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 E 1 1 1 1 J 1 1 [ I f 1 1 1 1 1 1 

Sbjct: 87547 atctcaaagttccccaaaaagctgcatacagatgtggtggaaaacggtggaaacttctct 87606 
Query: 3349 gtgggggagaggcagctgctctgcattgccagggctgtgcttcgcaactccaag 3402 

II II IMIMIIIMIMMI 1 1 IMMMMIII M M MIIMIMMI I M 

Sbjct: 87607 gtgggggagaggcagctgctctgcattgccagggctgtgcttcgcaactccaag 87660 



Score = 216 bits (109)., Expect = 7e-53 
Identities = 109/109 (100%) 
Strand = Plus / Plus 



Query: 1248 ggccttcagcatgctggcctccttgaatctccttcggctgtcagtgttctttgtgcctat 1307 

MIIMMMMIIMIIIIIIMIIMIIIIII! MIIMIMMI! MIMMI Ml 

Sbjct: 44216 ggccttcagcatgctggcctccttgaatctccttcggctgtcagtgttctttgtgcctat 44275 
Query: 1308 tgcagtcaaaggtctcacgaattccaagtctgcagtgatgaggttcaag 1356 

M I II IIIIIMMI MM M II MIMMI MM II Ml IMIIMM 

Sbjct: 44276 tgcagtcaaaggtctcacgaattccaagtctgcagtgatgaggttcaag 44324 



Score = 



196 bits (99), Expect = 7e-47 



Identities = 99/99 (100%) 
Strand = Plus / Plus 

Query: 1 atgactaggaagaggacatactgggtgcccaactcttctggtggcctcgtgaatcgtggc 60 

IIIIIIIIIIIIIIIMIIIIIIIIIIIMiilllllllllllllllllllllllllll! 

Sbjct: 25846 atgactaggaagaggacatactgggtgcccaactcttctggtggcctcgtgaatcgtggc 25905 
Query: 61 atcgacataggcgatgacatggtttcaggacttatttat 99 

I M 1 1 1 M M 1 1 1 1 1 i I M j f 1 1 M I [ I i 1 1 i 1 1 1 1 M I 

Sbjct: 25906 atcgacataggcgatgacatggtttcaggacttatttat 25944 



Score = 188 bits (95), Expect = 2e-44 
Identities = 95/95 (100%) 
Strand = Plus / Plus 

Query: 3566 aggtggtagaatttgatcggccggaggtactgcggaagaagcctgggtcattgttcgcag 3625 

MM 1 1 1 II II 1 1 MINIMI M, II 1 1 1 1 1 1 II 1 1 1 1 1 1 1 Mill M 1 1 1 1 1 1 1 

Sbjct: 903 97 aggtggtagaatttgatcggccggaggtactgcggaagaagcctgggtcattgttcgcag 90456 
Query: 3626 ccctcatggccacagccacttcttcactgagataa 3660 

IIIIIIIIIIMIIIIIIIIlllllllllllllll 

Sbjct: 90457 ccctcatggccacagccacttcttcactgagataa 90491 



Score = 182 bits (92), Expect = le-42 
Identities = 92/92 (100%) 
Strand = . Plus / Plus 

Query: 2768 agctggcgtccagcttccaggccactgcccggattggcttggagacagaggcacagttca 2827 

Mi M M illl IN IMII INI MMI III M MINI MMIIIIIII IIMMMM 

Sbjct: 79077 agctggcgtccagcttccaggccactgcccggattggcttggagacagaggcacagttca 79136 
Query: 282 8 cggctgtagagaggatactgcagtacatgaag 2859 

lllllll IIIIIMMIIIIIIIIIMIIMI 

Sbjct: 79137 cggctgtagagaggatactgcagtacatgaag 79168 



Score = 161 bits (81), Expect = 4e-36 
Identities = 81/81 (100%) 
Strand = Plus / Plus 

Query: 3208 agattcaacctagatccctttgaccgtcacactgaccagcagatctgggatgccttggag 3267 

lllllll IIIIIIIIIIIMIIIillllllll [IIIMIMIMIIIIIIIIMIIMM 

Sbjct: 86796 agattcaacctagatccctttgaccgtcacactgaccagcagatctgggatgccttggag 86855 
Query: 3268 aggacattcctgaccaaggcc 3288 

IIIIIIIIIIIIIIIIIIMI 

Sbjct: 86856 aggacattcctgaccaaggcc 86876 



Score = 147 bits (74), Expect = 6e-32 
Identities = 74/74 (100%) 
Strand = Plus / Plus 



Query: 1805 gatacctccaggtgctccactgctgctccctgaatcgggacctggaacttctgccctttg 1864 

III MIMIIIMIMMIMIIIMIIIMM MINIM MIIMMMMIMM M 

Sbjct : 54466 gatacctccaggtgctccactgctgctccctgaatcgggacctggaacttctgccctttg 54525 
Query: 1865 gagacatgacagag 1878 

1 1 1 1 1 M M 1 1 1 [ I 

Sbjct: 54526 gagacatgacagag 54539 



Score = 147 bits (74), Expect = 6e-32 
Identities = 74/74 (100%) 
Strand = Plus / Plus 

Query: 1607 aggggatgatgttaggggtctgcggcaacacggggagtggtaagagcagcctgttgtcag 1666 

I M MINIMI MM MINIMI MM M MM I! I MM MINIM I II MMMI 

Sbjct: 49269 aggggatgatgttaggggtctgcggcaacacggggagtggtaagagcagcctgttgtcag 49328 



Query: 1667 ccatcctggaggag 1680 

1 1 1 1 1 1 1 1 II 1 1 M 

Sbjct: 49329 ccatcctggaggag 49342 



>AC096996. 1.1. 194627 

Length = 194627 

Score = 454 bits (229), Expect = e-124 
Identities = 229/229 (100%) 
Strand = Plus / Minus 

Query: 2217 ' ggttttccgctgccccatgagtttctttgacaccatcccaataggccggcttttgaactg 2276 

1 1 1 1 M I ( 1 1 1 1 1 1 1 1 1 1 M M 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 i M I M 1 1 i I i I ! 1 1 1 1 1 

Sbjct: 173330 ggttttccgctgccccatgagtttctttgacaccatcccaataggccggcttttgaactg 173271 
Query: 2277 cttcgcaggggacttggaacagctggaccagctcttgcccatcttttcagagcagttcct 2336 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 ! I 

Sbjct: 173270 cttcgcaggggacttggaacagctggaccagctcttgcccatcttttcagagcagttcct 173211 
Query: 2337 • ggtcctgtccttaatggtgatcgccgtcctgttgattgtcagtgtgctgtctccatatat 2396 

i 1 1 1 1 1 1 1 1 1 1 1 1 M M 1 1 1 1 1 1 1 1 1 1 1 1 M I i 1 1 1 1 1 1 M 1 1 1 1 1 i 1 1 1 i i 1 1 1 1 1 M I 

Sbjct: 173210 ggtcctgtccttaatggtgatcgccgtcctgttgattgtcagtgtgctgtctccatatat 173151 
Query: 2397 cctgttaatgggagccataatcatggttatttgcttcatttattatatg 2445. 

1 1 1 1 1 1 1 1 ! 1 1 1 1 1 M 1 1 1 1 1 1! 1 1 1 M II 1 1 1 ! i M 1 1 1 II 1 1 ! I II ! 

Sbjct: 173150 cctgttaatgggagccataatcatggttatttgcttcatttattatatg 173102 



Score = 408 bits (206), Expect = e-110 
Identities = 206/206 (100%) 
Strand = Plus / Minus 

Query: 1877 agattggagagcggggcctcaacctctctggggggcagaaacagaggatcagcctggccc 193 6 

1 1 1 1 1 1 1 1 1 E 1 1 1 1 M 1 1 M I M 1 1 S 1 1 1 i r I ! 1 1 1 1 1 1 J M 1 1 ! 1 1 f 1 1 1 1 1 i 1 1 1 [ 1 1 

Sbjct: 186382 agattggagagcggggcctcaacctctctggggggcagaaacagaggatcagcctggccc 186323 
Query: 1937 gcgccgtctattccgaccgtcagatctacctgctggacgaccccctgtctgctgtggacg 1996 

1 1 1 1 1 1 M 1 1 1 1 M 1 1 1 1 1 1 1 1 1 M 1 1 M ! I i i! 1 1 1 ! 1 1 1 1 1 i 1 1 1 1 1 1 M I M 1 1 i 1 1 



Sbjct: 



186322 gcgccgtctattccgaccgtcagatctacctgctggacgaccccctgtctgctgtggacg 186263 



Query: 1997 cccacgtggggaagcacatttttgaggagtgcattaagaagacactcagggggaagacgg 2056 

E M I ! M 1 1 1 M 1 1 1 1 1 1 1 1 1 J 1 1 1 1 1 II 1 1 M I M ! 1 1 1 1 1 1 1 1 [ ! j I M M 1 1 1 r I ! I 

Sbjct: 186262 cccacgtggggaagcacatttttgaggagtgcattaagaagacactcagggggaagacgg 186203 
Query:' 2057 tcgtcctggtgacccaccagctgcag 2082 

iiiiiiiiiiiiiiiiiiiiniMi 

Sbjct: 186202 tcgtcctggtgacccaccagctgcag 186177 



Score = 385 bits (194), Expect = e-103 

Identities = 194/194 (100%) \ 
Strand = Plus / Minus 

Query: 2857 aagatgtgtgtctcggaagctcctttacacatggaaggcacaagttgtccccaggggtgg 2916 

1 1 ! i I ! I ; 1 1 1 ! [ I ] M i M M ! 1 1 1 1 ! I i M 1 1 1 1 1 M 1 1 [ 1 1 i 1 1 M 1 1 1 1 1 ! M M I 

Sbjct: 163018 aagatgtgtgtctcggaagctcctttacacatggaaggcacaagttgtccccaggggtgg 162959 
Query: 2917 ccacagcatggggaaatcatatttcaggattatcacatgaaatacagagacaacacaccc 2976 

I M 1 1 M ! [ 1 1 M 1 1 1 M ! I M M i 1 1 1 1 M 1 1 1 1 1 M 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 

Sbjct: 162958 ccacagcatggggaaatcatatttcaggattatcacatgaaatacagagacaacacaccc 162899 
Query: 2977 accgtgcttcacggcatcaacctgaccatccgcggccacgaagtggtgggcatcgtggga 3036 

III II 1 1 II II II II II Mill Mil I Mil 1 1 I'll MM I MM MM I Ml II I Nil 

Sbjct: 162898 accgtgcttcacggcatcaacctgaccatccgcggccacgaagtggtgggcatcgtggga 162839 
Query: 3037 aggacgggctctgg' 3050 

" MM 1 1 Mill Ml 

Sbjct: 162838 aggacgggctctgg 162825 



Score = 371 bits (187), Expect = 2e-99 
Identities = 187/187 (100%) 
Strand = Plus / Minus 

Query: 2583 gtttaagaggctgactgatgcgcagaataactacctgctgttgtttctatcttccacacg 2642 

IIIIIIIIIIIMIIIMIIIIIIIIMIIIIIIIIIMIIMIIIMIIMIIMMII 

Sbjct : 170528 gtttaagaggctgactgatgcgcagaataactacctgctgttgtttctatcttccacacg 170469 
Query: 2643 atggatggcattgaggctggagatcatgaccaaccttgtgaccttggctgttgccctgtt 2702 

1 1 II 1 1 II M I II 1 1 1 1 III 1 1 1 1 1 II I II II II I II M 1 1 1 1 II 1 1 1 1 1 II I II II II I 

Sbjct : 170468 atggatggcattgaggctggagatcatgaccaaccttgtgaccttggctgttgccctgtt 170409 
Query: 2703 cgtggcttttggcatttcctccaccccctactcctttaaagtcatggctgtcaacatcgt 2762 

ii ii 1 1 ii ii ii 1 1 1 M ii 1 1 ii M ii 1 1 ii i M 1 1 1 M i ii M 1 1 M i M ii ii 1 1 ii i 

Sbjct : 170408 cgtggcttttggcatttcctccaccccctactcctttaaagtcatggctgtcaacatcgt 170349 

Query: 2763 gctgcag 2769 
II I I II I 

Sbjct: 170348 gctgcag 170342 



Score ss 337 bits (170), Expect = 3e-89 
Identities = 170/170 (100%) 



Strand = Plus / Minus 



Query: 3401 agatcatccttatcgatgaagccacagcctccattgacatggagacagacaccctgatcc 3460 

1 1 1 J E 1 1 1 1 M I j ! I ! 1 1 1 M 1 1 1 i M 1 1 1 [ i 1 1 M 1 1 1 ! M 1 1 i 

Sbjct : 153564 agatcatccttatcgatgaagccacagcctccattgacatggagacagacaccctgatcc 153505 
Query: 3461 agcgcacaatccgtgaagccttccagggctgcaccgtgctcgtcattgcccaccgtgtca 3520 

: mi 1 1 1 1 1 1 1 1 m 1 1 1 m 1 1 1 1 1 ri 1 1 1 1 1 ii rii 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 ii 1 1 m n 

Sbjct: 153504 agcgcacaatccgtgaagccttccagggctgcaccgtgctcgtcattgcccaccgtgtca 153445 
Query: 3521 ccactgtgctgaactgtgaccacatcctggttatgggcaatgggaaggtg 3570 

IIMN'MMIM 111111111111111111111111111 1 Hill I III 

Sbjct: 153444 ccactgtgctgaactgtgaccacatcctggttatgggcaatgggaaggtg 153395 



Score = 319 bits (161), Expect = 7e-84 
Identities = 161/161 (100%) 
Strand = Plus / Minus 



Query: 3049 gggaagtcctccttgggcatggctctcttccgcctggtggagcccatggcaggccggatt 3108 

MIIIMMMIMMIIIMMMIMMMIMIMIIIIIIIIMMIIIIMIMI 

Sbjct: 161320 gggaagtcctccttgggcatggctctcttccgcctggtggagcccatggcaggccggatt 161261 
Query.: 3109 ctcattgacggcgtggacatttgcagcatcggcctggaggacttgcggtccaagctctca 3168 

M I i 1 1 M M ! 1 1 M 1 1 1 1 M ( 1 1 M 1 1 1 1 II 1 1 j M 1 1 1 1 i 1 1 M I M M i M 1 1 1 i i i 

Sbjct: 161260 ctcattgacggcgtggacatttgcagcatcggcctggaggacttgcggtccaagctctca 161201 
Query: 3169 gtgatccctcaagatccagtgctgctctcaggaaccatcag 3209 

M 1 1 1 1 1 M I i 1 1 1 1 1 1 1 1 1 1 M I M 1 1 1 1 M 1 1 1 1 1 Ii 1 1 

Sbjct: 1612 00 gtgatccctcaagatccagtgctgctctcaggaaccatcag 161160 



Score = 289 bits (146), Expect = 6e-75 
Identities = 188/202 (93%) 
Strand = Plus / Minus 



Query: 1877 -agattggagagcggggcctcaacctctctggggggcagaaacagaggatcagcctggccc 1936 

iiiiiii ni 1 1 iii i Milium illinium iiimn mmmi 

Sbjct: 101522 agattggggagcggggcctcaacctctctggggggcagaggcagaggattagcctggccc 101463 
Query: 1937 gcgccgtctattccgaccgtcagatctacctgctggacgaccccctgtctgctgtggacg 1996 

[Ml .11111 MINI IIIM MIIIMMIMMIihl, Mill I IIIIIII 

Sbjct: 101462 gcgctgtctactccgaccgtcagctctacctgctggacgaccccctgtcggccgtggacg 101403 



Query: 1997 cccacgtggggaagcacatttttgaggagtgcattaagaagacactcagggggaagacgg 2056 

[ M 1 1 1 1 1 [ 1 1 1 1 1 1 1 ! I lllllllllllllllllllllll IIIIIMI Mill I 

Sbjct: 101402 cccacgtggggaagcacgtctttgaggagtgcattaagaagacgctcaggggaaagacag 101343 
Query; 2057 tcgtcctggtgacccaccagct 2078 

1 1 1 f i 1 1 [ ) i e 1 1 1 r i 1 1 1 1 e i 

Sbjct: 101342 tcgtcctggtgacccaccagct 101321 



Score = 278 bits (140), Expect = 2e-71 
Identities = 140/140 (100%) 



Strand = Plus / Minus 
Query: 2080 cagtacttagaattttgtggccagatcattttgttggaaaatgggaaaatctgtgaaaat 2139 

MM MM MINIMI MINI! IMMII! MINIM Ml Ml II MINIMUM 

Sbjct: 184181 cagtacttagaattttgtggccagatcattttgttggaaaatgggaaaatctgtgaaaat 184122 
Query: 2140 ggaactcacagtgagttaatgcagaaaaaggggaaatatgcccaacttatccagaagatg 2199 

' 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 ! 1 1 1 1 1 1 1! I II 1 1 1 M 1 1 M 1 1 1 1 i I ( 1 1 1 1 1 1 1 1 ! i ! M 1 1 

Sbjct: 184121 ggaactcacagtgagttaatgcagaaaaaggggaaatatgcccaacttatccagaagatg. 184062 
Query: 2200 cacaaggaagccacttcggt 2219 

Mil j MM III I MM II! 

Sbjct: 184061 cacaaggaagccacttcggt 184042 



Score = 278 bits (140), Expect = 2e-71 
Identities = 140/140 (100%) 
Strand = Plus / Minus 



Query: 2445 gatgttcaagaaggccatcggtgtgttcaagagactggagaactatagccggtctccttt 2504 $ 

1 1 ] I J 1 1 M 1 1 1 1 1 M 1 1 1 1 1 M I II 1 1 M II 1 1 ! 1 1 i I M 1 1 1 1 M I i M M 1 1 1 J ( 1 1 

Sbjct: 172992 gatgttcaagaaggccatcggtgtgttcaagagactggagaactatagccggtctccttt 172933 
Query: 2505 attctcccacatcctcaattctctgcaaggcctgagctccatccatgtctatggaaaaac 2564 

I M 1 1 M 1 1 1 f M 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 M M 1 1 1 1 ! 1 1 M 1 1 1 1 1 1 1 ! 1 1 1 

Sbjct: 172932 attctcccacatcctcaattctctgcaaggcctgagctccatccatgtctatggaaaaac 172873 
Query: 2565- tgaagacttcatcagccagt 2584 

II 1 1 1 1 M I II 1 1 1 1 1 1 1 1 1 

Sbjct: 172872 tgaagacttcatcagccagt 172853" 



Score = 252 bits (127), Expect = le-63 
Identities = 127/127 (100%) 
Strand = Plus / Minus 

Query: 1679 agatgcacttgctcgagggctcggtgggggtgcagggaagcctggcctatgtcccccagc 1738 

IMMIMMIMMIMIMIMMMIIMMIIIMII MIIIIIIIIIIMIMII 

Sbjct: 191439 agatgcacttgctcgagggctcggtgggggtgcagggaagcctggcctatgtcccccagc 191380 
Query: 1739 aggcctggatcgtcagcgggaacatcagggagaacatcctcatgggaggcgcatatgaca 1798 

. j 1 1 1 j 1 1 1 1 1 1 1 j 1 1 r [ 1 1 r i 1 1 1 1 ; 1 1 1 1 1 1 j i j f ? 1 1 1 1 1 1 1 1 i j j i [ i s 1 1 ; i f 1 1 1 ' 

Sbjct: 191379 aggcctggatcgtcagcgggaacatcagggagaacatcctcatgggaggcgcatatgaca 191320 
Query:- 1799 aggcccg 1805 

,111:1' 

Sbjct: 191319 aggcccg 191313 



Score = 226 bits (114) , Expect = 8e-56 
Identities = 114/114 (100%) 
Strand = Plus / Minus 



Query: 3289 atctcaaagttccccaaaaagctgcatacagatgtggtggaaaacggtggaaacttctct 3348 

Mill Mill MM MINIIIIM I MNMM IMMI I! MINIM Ml I Ml Ml 



" 0 v 



Sbjct : 156120 atctcaaagttccccaaaaagctgcatacagatgtggtggaaaacggtggaaacttctct 156061 
Query: 3349 gtgggggagaggcagctgctctgcattgccagggctgtgcttcgcaactccaag 3402 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M I II 1 1 1 M 1 1 1 II II I II 1 1 1 1 1 II 

Sbjct: 156060 gtgggggagaggcagctgctctgcattgccagggctgtgcttcgcaactccaag 156007 

Score. = 188 bits (95), Expect = 2e-44 
Identities = 95/95 (100%) 
Strand = Plus / Minus 

Query: 3566 aggtggtagaatttgatcggccggaggtactgcggaagaagcctgggtcattgttcgcag 3 625 

I [ 1 1 E 1 [ i ! J 1 1 1 1 1 1 ! 1 1 1 1 f 1 1 1 J 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 J J 1 1 1 1 1 1 1 1 E 1 1 ! 1 1 1 1 1 1 

Sbjct: 153270 aggtggtagaatttgatcggccggaggtactgcggaagaagcctgggtcattgttcgcag 153211 



Query: 3626 ccctcatggccacagccacttcttcactgagataa 3660 

1 1 [ 1 1 1 M 1 1 1 1 1 1 1 i 1 1 1 M 1 1 1 1 1 1 M 1 1 II i t 

Sbjct: 153210 ccctcatggccacagccacttcttcactgagataa 153176 



Score = 182 bits (92), Expect = le-42 
Identities = 92/92 (100%) 
Strand = Plus / Minus 

Query: 2768 agctggcgtccagcttccaggccactgcccggattggcttggagacagaggcacagttca 2827 

MMirii! M 1 1 IMIIMM MIIMIM! IIMMIM I II I 1 1 Ml I M INI 

Sbjct: 164590 agctggcgtccagcttccaggccactgcccggattggcttggagacagaggcacagttca 164531 
Query: 2828 cggctgtagagaggatactgcagtacatgaag 2859 

Ml 1 1 II I Mil INI Mil Mill IN IN I 

Sbjct: 16453 0 cggctgtagagaggatactgcagtacatgaag 164499 



Score = 161 bits (81), Expect = 4e-36 
Identities = 81/81 (100%) 
Strand = Plus / Minus 



Query: 3208 agattcaacctagatccctttgaccgtcacactgaccagcagatctgggatgccttggag 3267 

1 1 r E E 1 1 1 1 1 1 1 1 f I J f 1 1 J 1 1 1 i I E 1 1 1 J 1 1 1 i 1 1 1 1 i 1 i 1 1 1 1 1 1 1 1 1 1 1 1 J I J ! 1 1 1 

Sbjct: 156871 agattcaacctagatccctttgaccgtcacactgaccagcagatctgggatgccttggag 156812 
Query: 3268 aggacattcctgaccaaggcc 3288 

llllllllllll lllllllll 

Sbjct: 156811 aggacattcctgaccaaggcc 156791 



Score = 147 bits (74) , Expect = 6e-32 
Identities = 74/74 (100%) 
Strand = Plus / Minus 



Query: 1805 gatacctccaggtgctccactgctgctccctgaatcgggacctggaacttctgccctttg 1864 

II II IMMI II II III I Ml III! II MINIMI MINIM I II MINI I III III 

Sbjct: 189201 gatacctccaggtgctccactgctgctccctgaatcgggacctggaacttctgccctttg 189142 



Query: 1865 gagacatgacagag 1878 



MIIMIIIIMII 

Sbjct: 189141 gagacatgacagag 189128 



Score = 147 bits (74), Expect = 6e-32 
Identities = 74/74 (100%) 
Strand = Plus / Minus 

Query: 1607 aggggatgatgt.taggggtctgcggcaacacggggag'tggtaagagcagcctgttgtcag 1666 

INI Mil III III! Ill I INI Nil I II I III 1 1 II I III II II 1 1 II ITII I II III 

Sbjct: 194398 aggggatgatgttaggggtctgcggcaacacggggagtggtaagagcagcctgttgtcag 194339 
Query: 1667 ccatcctggaggag 1680 

! II 1 1 1 1 1 1 1 1 [ 1 1 

Sbjct: 194338 ccatcctggaggag 194325 
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